Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 1 Introduction To Statistics PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 161

CHAPTER 1

INTRODUCTION TO STATISTICS

Expected Outcomes
 Able to define basic terminologies of statistics.
 Able to apply the basic steps in the statistical problem-solving
methodology for various applications.
 Able to summarise and analyse data using measures of central
tendency, measures of variation and measures of position.
 Able to relate the concept of accuracy and precision of data using game
of darts.
 Able to conduct exploratory data analysis that includes numerical data
analysis and various graphical displays.
 Able to plot and interpret normal probability plot.

SZS2017
CONTENT
1.1 Statistical Terminologies
1.2 Statistical Problem Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.2.1 Accuracy and Precision
1.3.3 Measures of Position
1.3.4 Descriptive Statistics Using Microsoft Excel
1.4 Exploratory Data Analysis
1.4.1 Outliers
1.4.2 Box Plot
1.5 Normal Probability Plot

SZS2017
1.1 STATISTICAL
TERMINOLOGIES
 Define the meaning of statistics, population,
sample, parameter, statistic, descriptive statistics
and inferential statistics.
 Discuss the importance of statistics in daily lives.

SZS2017
1.1.1 What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
 Ten thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
 The death rate from lung cancer was 10 times higher for smokers compared
to nonsmokers.
 The average cost of a wedding is nearly RM10,000 in Malaysia.
 In Malaysia, the median salary for men with a bachelor’s degree is
RM 30,000 per year, while the median salary for women with a bachelor’s
degree is RM 29,000 per year.
 Globally, an estimated of 500,000 children under the age of 15 live with Type
1 diabetes.
 Women who eat fish once a week are 29% less likely to develop heart disease.

SZS2017
What is Statistics?
 The sciences of conducting studies to collect, organise, summarise,
analyse, present, interpret and draw conclusions from data.

Any values (observations or measurements) that have been collected

 Collection and analysis of data are the most important part in research
methodology.
 Researchers must have a basic knowledge of statistics before starting any
research or study involving data analysis.
 Statistics is also used to analyse the results of surveys and as a tool in
scientific research to make decisions based on controlled experiments,
estimation, prediction, and quality control.

SZS2017
1.1.2 Why we Need Statistics?
 Basic knowledge of statistics is needed in any disciplines or any field of
research or study (in almost all fields of human endeavour) that involve data
analysis.
 The methods of statistics allow the researchers to design a valid experiment
and finally draw a reliable conclusion or interpretation from the data they
produced and analysed.
Examples:

 In sports, statistician may keep records of the number of successful kicks a


team scored during a football season.
 In public health, a doctor might be concerned with the number of child who
are infected with a H1N1 virus during a certain year.
 In education, an educator might want to know if the performance of
students in current semester are better than the previous semester.
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:

1. Describing the relationship between variables.


a. A university admission director needs to find an effective way of
selecting students. He designed a statistical study to see if there is a
significance relationship between SPM result and the GPA achieved by
first year students at his university. If there is a strong relationship,
high SPM result will become an important criterion for admission.
b. A management consultant wants to compare a client’s investment
return for this year with related figures from last year. He summarises
the revenue and cost data from both periods and find the relationship
between these two variables. Based on his findings, he presents his
recommendations to his client.
Variables is a characteristic or attribute that can assume different values. These
values are data. It is called random variables if the values are determined by chance.
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:

2. Making better decision in the face of uncertainty.


a. Suppose that a manager of Unisex Hair Stylist claimed that 90% of the
customers are satisfied with the services. If a consumer activist feels
that this is an exaggerated statement that might require legal action,
the activist can use statistical inference techniques to decide whether
or not to sue the manager. Therefore, the knowledge gained from
studying statistics can enhance the awareness towards becoming
better consumers.
b. People can make intelligent decisions about what products to purchase
based on consumer studies about government spending based on
utilisation studies, and so on.

SZS2017
1.1.3 Population and Sample
Population (N)
Tangible
A complete collection of finite and the total number of
measurements, outcomes, objects or subjects is fixed and could be listed
individuals under study. → all computers in a room, all female
students in a university, or all electrical
components manufactured in a day, etc.

Conceptual (Intangible)
all values that might possibly have
been observed and has an unlimited
number of subjects.
→ simulated data from computer or
Sample (n) instrument, number of germs on human
A subset of the population that body, all experimental data such as all
measurements of length of metal rod, etc.
is observed
SZS2017
Parameter and Statistic
Parameter Statistic
A numerical value that represents a A numerical value that represents a
certain population characteristic certain sample characteristic

 The average of weight of students from a  The average of weight for a sample of
population of students in a university female students selected from all students in
 The percentage of defective components in a university
a population of electrical components  The percentage of defective components in
manufactured in a day a sample of 100 electrical components

Measurement Parameter Statistic

Mean (Average)  x
Variance 2 s2
Standard deviation  s
Proportion  p

SZS2017
EXAMPLE 1.1
A travel agent claims that the average number of rooms in large hotels in
Pahang is 500 and the standard deviation is 165. A sample of seven hotels in
Genting Highlands is selected and the average number of rooms is found to be
435 with standard deviation of 15.
Based on the above example:

 The population under study is all large hotels in Pahang.


 The sample selected is seven large hotels in Genting Highlands.
 The population under study is tangible since there are finite numbers of
large hotels in Pahang.
 The characteristic (variable) is number of rooms.
 The parameters are   500 and 𝜎 = 165 since they describe the
population characteristics.
 The statistics are 𝑥ҧ = 435 and s = 15 since they describe the sample
characteristics.

SZS2017
EXERCISE 1.1.3
The number of first year students at a residential college is 317 students. An IQ
pre-test is given to all of them in their first week. The dean of admission
collected data on 27 of them and found their mean score on the IQ pre-test was
51. The mean for the entire first year students was estimated to be
approximately 51. A subsequent computer analysis of all first year students
showed that the true mean (population mean) is 52.
Based on the above statement, answer the following questions.

a) What is the population?


b) Is the population tangible or conceptual?
c) What is the sample?
d) What is the variable of the study
e) Which number describes a parameter?
f) Which number describes a statistic?

a) 317 first year students b) tangible c) 27 first year students d) IQ pre-test score e) 52 f) 51

SZS2017
1.1.4 Descriptive and
Inferential Statistics
Descriptive statistics Inferential statistics
 Includes the process of data collection,  Involves a process of generalisation,
data organisation, data classification, estimations, hypothesis testing, predictions
data summarisation, and data and determination of relationships between
presentation obtained from the sample.
variables.
 Used to describe the characteristics of
the sample.  Used to describe, infer, estimate,
 Used to determine whether the sample approximate the characteristics of the target
represents the target population by population.
comparing sample statistic and  Used when we want to draw a conclusion
population parameter. for the data obtain from the sample.

EXAMPLE: EXAMPLE:
Ten thousands parents in Malaysia have The death rate of lung cancer was 10 times
chosen Takaful Insurance as their higher for smokers compared to
trusted life insurance agency. nonsmokers .

SZS2017
Overview of descriptive
and inferential statistics

SZS2017
EXERCISE 1.1.4

In the statements below, decide whether the statements describe the


descriptive statistics or inferential statistics.

a) The average cost of a wedding is nearly RM10,000.

b) In Malaysia, the median salary for men with a bachelor’s degree


is RM 30,000 per year, while the median salary for women with a
bachelor’s degree is RM 29,000 per year.

c) Globally, an estimated of 500,000 children under the age of 15


live with Type 1 diabetes.

d) A researcher claims that a new drug will reduce the number of


heart attacks in men over 70 years of age.

SZS2017
1.1.5 Role of the Computer in Statistics

Two software tools commonly used for data analysis:

1. Spreadsheets
 Microsoft Excel & Lotus 1-2-3

2. Statistical Packages
 AMOS, eViews, MINITAB, R, SAS, SmartPLS,
SPSS and SPlus

SZS2017
Data Analysis Application Tools in EXCEL

1. Graph and chart

2. Formulas

3. Data Analysis Tools:


File → Options → Add-Ins
→ Analysis ToolPak → ok
→ Data → Data Analysis

SZS2017
Chose
Analysis
ToolPak
and click
Go

SZS2017
Tick Analysis
ToolPak
and click ok

SZS2017
→ Now we can use the Data Analysis
Application in Microsoft Excel to analyse data.

SZS2017
1.2 STATISTICAL
PROBLEM- SOLVING
METHODOLOGY
 Outline the six basic steps in the statistical
problem-solving methodology.
 Identify various sampling methods.
 Classify type of data and level of measurement.

SZS2017
Statistical Problem-Solving
Methodology

SZS2017
Statistical Problem-Solving
Methodology

SZS2017
1.2.1 Identify the Problem or Opportunity
The researchers must clearly understand and define the objective of the study
before conducting any research. Possible questions that could be asked before
starting any study are given as follows.

 What are the problem and objective of the study?


 What are the possible variables that are related to the study?
 Can the study goal be achieved through simple counts or measurements of
the group?
 What are possible treatments should be imposed on the group and what are
their responses?
 Should the experiment be performed on the group?
 Do the data come from population or sample?
 If samples are needed, how large the sample size is appropriate? How
should they be taken?
SZS2017
Characteristics of Sample
 A sample is a subset of population.
 The population is a complete group of people, companies, hospitals,
stores, university, students, and etc., that share some set of
characteristics.
 A census involves the whole population which possesses a greater
likelihood of non-sampling errors.
 Sampling error is calculated when the statistical characteristics of a
population are estimated from a subset, or sample, of that population.
The difference between the sample and population values is considered as
a sampling error.
 Non-sampling errors is an error that are not due to sampling. As example,
in a survey, mistakes may occur in the selection of people.

SZS2017
Characteristics of Sample Size
 The larger the sample size, the smaller the magnitude of sampling errors
would be.
 Studies using survey method need a larger sample size since the survey is
a voluntarily based.
 Studies using mail response need a much larger sample size. Normally,
the response is as low as 20%-30% responses.
 The ideal sample size in a study should be large enough to serve as an
adequate representative of the population in order to generalise the
overall population.
 The optimal sample size depends on statistical distribution used and for
the purpose of generalisation to the whole population.
 Researcher may refer to Krejcie and Morgan (1970) as a guideline to
obtain an adequate sample size.

SZS2017
1.2.2 Deciding on the
Method of Data Collection
 Data must be collected as complete as possible, accurate & relevant to the
problem in order to solve the problem.

 Data could be obtained in 3 ways:

1) Data that are made available by others (internal, external, primary or


secondary data)
 It is similar to historical or observed data.
 The availability of the data depends on the primary and secondary
resources of document, evidence that includes interviews, observation
method, minutes of meeting, formal policy statement etc.
 Example: Rainfall data collected from Malaysian Meteorological
Department is a secondary data.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
 Data could be obtained in 3 ways:

2) Data resulting from an experiment (experimental study):


 In an experimental study, the researcher manipulates one of the
variables and study on how the manipulation influences other variables
provided that the treatment and the subjects are assigned to groups
randomly.
 Example: Blood glucose level data obtained from diabetic patients
before and after a treatment is an example of experimental data.

3) Data collected in an observational study (observation, survey,


questionnaire):
 Observations VS interviews
SZS2017
Observation method
 In qualitative research: used to study the behaviours or events and the
context that surrounds the behaviours or events and between the behaviour
and the event.
 In quantitative research: used to collect data regarding the number of
occurrences in a specific period of the time, or duration of a very specific
behaviour or event.
 The detail descriptions or data collected in qualitative research can be
converted later to numerical data and can be analysed quantitatively.
 Observations method can be used in setting the physical environment, social
interactions, physical activities, non-verbal communications, planned and
unplanned activities.
 Example: A study on customer’s behaviour towards type of brands in a
certain shopping complex is an example of observational study.

SZS2017
Interviews method
 The purpose of interview in collecting data is to find out what is in or on
someone else’s mind.
 Interview data can easily become biased and misleading if the interviewed
person is aware of the perspective of the interviewer.
 It is very important to make sure the person being interviewed does not
hold any preconceived notions regarding the outcome of the study.
 Interviews range from quite informal and completely open-ended to very
formal with the questions predetermined and asked in a standard manner.
 Usually, interviews are used to gather information regarding an individual’s
experience and knowledge; his/her opinions, beliefs, and feelings, and
demographic data.
 Example: An interviewer is interested to gather information on the way
nurses organise their care in hospital wards and conduct an interview
session.

SZS2017
Other Methods of Data Collection
• Questionnaires and surveys (Quantitative + Qualitative).
• Opinions (Qualitative + Quantitative).
• Projective technique and psychological tests (both).
• Proxemics – Study of people’s use of space and their relationship to
culture.
• Kinetics – Study of body movement or people communicate
nonverbally.
• Street Ethnography – Concentrate on a person becoming a part of
the place under study.
• Narratives – Study people’s individual life stories.
• Triangulation – The used of multiple data collection techniques
(Triangulation of data permits the verification and validation of
qualitative data.

SZS2017
EXERCISE 1.2.2
Identify each of the following studies as being either observational or
experimental.

a) Subjects were randomly assigned to two groups, and one group was
given a herb and the other group a placebo. After 6 months, the
numbers of respiratory tract infections each group were compared.
b) A researcher stood at a busy intersection to see if the colour of an
automobile a person drives is related to running red lights or not.
c) A researcher finds that people who are more hostile have higher
total cholesterol levels than those who are less hostile.
d) Subjects are randomly assigned to four groups. Each group is
placed on one of four special diets—a low-fat diet, a high-fish diet, a
combination of low-fat diet and high-fish diet, and a regular diet.
After 6 months, the blood pressures of the groups are compared to
see if diet has any effect on blood pressure or not.
SZS2017
1.2.3 Collecting the Data
(Sampling Techniques)
 Sampling is a process of selecting few samples from a population to
become the basis for estimating or predicting the prevalence of an
unknown piece of information, situation or outcome regarding the
bigger group.
i. Non-probability sampling (judgment, voluntary, convenience):
• Sample collected based on the judgment of the experimenter.
• Resulting samples might be biased.
ii. Probability sampling (random, systematic, stratified, cluster):
• The chances is known before the sample is picked.
• Resulting samples are unbiased.

 Each collected data from a sampling process can be classified either as


a non-probability data or probability data.
SZS2017
Judgment

Voluntary
Nonprobability
sampling Convenience

Snowball
Others
Sampling Quota
Techniques Random

Systematic

Probability
Cluster
sampling

Stratified Multi-stage

Others K-Sampling

Nested

SZS2017
A. Nonprobability Sampling Methods
Non-probability Sampling Methods Example
Judgment sampling A political campaign manager intuitively
Data is selected based on opinion of one or picks certain voting districts as reliable
more experts. places to measure the public opinion of his
candidates.
Voluntary sampling
Questions are posed to the public by A call-in radio show asks their listeners to
publishing them over radio or television via participate in surveys on controversial
phone, short message, email etc. The topics such as abortion, affirmative action,
resulting sample tends to over represent gun control, politic, etc.
individuals who have strong opinions.
Convenience sampling
The data selected is an “easy sample”, A surveyor will stand in one location and
haphazard or accidental sampling. ask passerby the questions.
The researcher obtains units or people who
are most conveniently available.

SZS2017
B) Probability Sampling Methods
1. Random sampling
• Each data is numbered, and then the
data is selected using chance or
random method such as random
number.
• When a sample is chosen at random,
it is said to be an unbiased sample.
• Random sample can be selected with
or without replacement.

Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her
university. There are 5000 students enrolled at the university, and he/she wants to draw a
sample of size 100 to take a physical fitness test.
She could obtains a list of all 5000 students, numbered it from 1 to 5000 and then
randomly invites 100 students corresponding to those numbers to participate in the study.
SZS2017
Generating Random Number
• Generating random number is an important step in obtaining
random sample.
• In random number, each number has equal chance to be selected.
• Random number can be generated from calculator, softwares, or
random number table.

• As example, suppose we have data numbered from 1 to 100 and


we want to choose five samples only. Hence, using R-language we
can use the R command “sample (1: 100, 5)”. The resulted output is
the five number listed randomly.

SZS2017
B) Probability Data Samples
2. Systematic sampling
• A set of data is numbered from 1 to N .
 x1, x2 ,
, xN 
• The first data is selected randomly within
number 1 and k where k=N/n and n
sample size.
• The next number are selected every k
interval to produce n samples.

Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her university
and he/she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list
of all 5000 students, numbered it from 1 to 5000 and randomly picks one of the first 50 voters
(k=5000/100) on the list. If the first picked number is 30, then the 30th student in the list
should be invited first. Then she should invite every 50th name on the list after this first
random number starts (the 80th student, the 130th student and so on) to produce 100 samples
of students to participate in the study.
SZS2017
B) Probability Data Samples
3. Stratified sampling
• The population is divided into groups
according to some characteristic that is
important to the study, and then the sample
is selected from each group using random or
systematic sampling.
• The characteristics are homogeneous
(similar) within each group but
heterogeneous (dissimilar) among the groups
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between male and female students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
The random method or systematic method can be used to select the participants. As an
example, she use random sample to choose 50 male students and use systematic method
to choose another 50 female students or otherwise.
SZS2017
B) Probability Data Samples
4. Cluster sampling
• The population is divided into groups or
clusters, then some of those clusters are
randomly selected and all members from
those selected clusters are chosen.
• Cluster sampling can reduce cost and time.
• Each cluster has heterogeneous
characteristic but has homogeneous
characteristic among the clusters.
• We can choose more than one cluster.
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between 1st year, 2nd year, 3rd year and senior students. To account for this variation in
lifestyle, the population of student can easily be clustered into four categories.
Then, she can choose any clusters and chose all students in that clusters as the
participants. For example, all 2nd year students are chosen as the participants.

SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Judgement When the population - Fast and conclusive. - Biased since it based on
Sampling is too large. opinion of one or more
expert only.
Voluntary When the members - Fast response. - Samplings are too
Sampling of the population are - Easy to obtain lager random.
convenient to be sample sizes. - Sometimes not reliable.
sampled. - Degree of generalisability
is questionable.
Convenience When the members - Fast and easy. - Samplings are too
Sampling of the population are - Convenience and random.
convenient to be inexpensive. - Sometimes not reliable,
sampled. - Degree of generalisability
is questionable.

SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Random When the members of - Use table of random - High cost.
Sampling the population are number. - Time consuming for large
similar to one another - Each data has an equal sample size.
on important chance to be selected. - Tedious.
variables. - Ensures a high degree of
representativeness.
Systematic When the members of - Relatively easy to - There is a risk of data
Sampling the population are construct, execute, manipulation.
similar to one another compare and understand. - Not the best method if the
on important variables - The process can be researcher does not know
controlled. the background of the
- Good for tight budget population.
research. - Less random than simple
- Ensures a high degree of random sampling.
representativeness.
- No need to use a table of
random number.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Stratified When the population - Variety of samples. - Time consuming.
Sampling is heterogeneous and - Ensures a high degree of - Tedious.
contains several representativeness of all
different groups, some the strata or layers in the
of which are related to population.
the topic of the study.

Cluster When the population - Less energy and money. - Possibly, members of units
Sampling consists of units rather - Easy and convenient. are different from one
than individuals. - Save time. another, decreasing the
techniques effectiveness.

SZS2017
Random Data Generation
From Normal Distribution

𝑋~𝑁 𝜇, 𝜎 2 𝑜𝑟 𝑍~𝑁 0, 1
𝜇 is mean
2
𝜎 is variance

SZS2017
Random Data Generation
From Poisson Distribution

X~Po λ , λ is average
value

SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.

a) Suppose a researcher has a list of 1000 registered voters in a


community and he wants to pick a probability sampling of 50 samples.
He uses a random number table to pick one of the first 20 voters
(1000/50 = 20) on the list. The table gave him the number of 16, so he
selects the 16th voter on the list as the first selected number. Then he
picks every 20th name after the first random number start (the 36th
voter, the 56th voter, etc.) until 50 samples obtained.

b) In a consumer survey of large cities, a researcher divides a map of the


city into small blocks. Each block containing a cluster is surveyed. A
number of clusters are selected for the sample, and all the households
in a cluster are surveyed. Less energy and money are expended if an
interviewer stays within a specific area rather than traveling across
stretches of the cities.

SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.

c) Researchers or farm managers may be called in when a crop shows a certain


growing pattern or when surface differences are observed for a soil. For
example, differences may occur in soil color which may be the result of many
factors. A researcher is called to judge a particular shade of colour to be
typical for a sample at certain sites. Then from these sites, samples are
drawn.
d) The population of university professors is divided into groups according to
their rank (instructor, assistant professor, etc.) and several are selected from
each group to make up a sample.
e) A surveyor stands outside a shop in the East Cost Mall and randomly selects
people to participate in a quiz.
f) A quality engineer wants to inspect rolls of wallpaper in order to obtain
information on the rate at which flaws in the printing are occurring. She
decides to draw a sample of 50 rolls of wallpaper from a day’s production. At
the end of each hour, for 5 consecutive hours, she takes the 10 most
recently produced rolls and counts the number of flaws on each.
SZS2017
MIND EXPANDING EXERCISES
1. Statistics can be applied across many disciplines or any fields of
research and almost in all fields in human endeavour. Based on this
statement, suggest reasons why statistics is important.

2. Is a large sample necessarily a good sample? Why or Why not?

3. Suppose you have been hired by a radio station in Malaysia to


determine the age distribution of their listeners. Describe in detail
how you would select at least 3000 sample of listeners. Chose the
best sampling techniques and state the reason. The sampling
techniques can be mix or combine.

SZS2017
1.2.4 Classifying and Summarising
the Data
 In this step, the collected data are organised properly for further study and
investigation.
 Data that has been collected during the sampling process is called raw data.
 The simplest way to organise raw data systematically is by using data array.
Data array is an arrangement of data items in either ascending or
descending order (sorting).

1.2.4.1 Classifying
 identify items with the same characteristics & arranging them into
groups or classes.
 Data could be classified by its type or by its level of measurement.
1.2.4.2 Summarisation
 Graphical & Descriptive statistics ( tables, charts, measures of central
tendency, measures of variation, measures of position)
SZS2017
Example of Raw Data

Data can be organised


by column or row

SZS2017
1.2.4.1 Data Classification

 Data are the values that variables can assume.


 Variables is a characteristic or attribute that can assume different values.
 Variables whose values are determined by chance are called random
variables.

Data can be
classified

By how they are categorized, counted


As Quantitative or or measured
Qualitative type - Level of measurements of data

SZS2017
Nominal Data
Qualitative The values cannot be ranked
(categorical/Attributes) Gender, race, citizenship,
 Data that refers to colour, etc.
classification name according Use code
to some characteristic or Ordinal Data numbers
The values can be ranked and (1, 2,…)
attribute
 Data is classified using code likert scale is used
numbers Feeling (dislike-like),
Type colour (dark-bright), etc.
of
Data Discrete Data
The values can be counted and finite
Number of student, number of cat,
Quantitative (Numerical) number of defect, etc.
 Data can be counted or
Continuous Data
measured The values can be placed within two
 Data can be ordered or ranked specified values, obtained by measuring,
have boundaries, and shall be rounded to
require decimal places
Weight, age, salary, temperature, etc.
SZS2017
Levels of Measurement of Data
Levels Descriptions Examples
Nominal-level Classifies data into mutually Zip code (4, 5, 6,…),
exclusive (non-overlapping), Post code (25000, 25600, …),
exhausting categories in which Gender (female, male),
no order or ranking can be Eye colour (blue, brown, green, hazel),
imposed on the data. Political affiliation, Religion,
Nationality
Ordinal-level Classifies data into categories Grade (A, B, C, D, etc.),
that can be ranked; however, any Judging (first place, second place, etc.),
specific differences between the Rating scale (poor, good, excellent).
ranks do not exist. Color (light blue, …, dark blue)
Interval-level Ranks the data, and precise IQ test
differences between units of Temperature
measure do exist; however, there Shoe size
is no meaningful zero.
Ratio-level Possesses all the characteristics Height, Weight, Time, Salary
of interval measurement, and
there exists a true zero. SZS2017
EXERCISE 1.2.4.1
1. The SuperMotor Marketing Corporation has asked you for information
about the car you drive. For each question, identify each of the types of data
requested as either attribute data or numeric data. When atribute data is
requested, identify the variable either as nominal or ordinal. When
numeric data is requested, identify the variable either as discrete or
continuous. Then, identify the level of measurement for each variable.

a) What is the weight of your car?


b) In what city was your car made?
c) How many people can be seated in your car?
d) What is the distance traveled from your home to your school?
e) What is the color of your car?
f) How many cars are in your household?
g) What is the length of your car?
h) What is the normal operating temperature (in C) of your car’s engine?
i) How much does the petrol mileage (km/l) do you get in city driving?
j) Who made your car?
k) How many cylinders are there in your car’s engine?
l) How many kilometres have you put on your car’s current set of tyres?

SZS2017
EXERCISE 1.2.4.1
2. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.

Type of transportation Number of job related


Industries injuries
Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950
a) What are the variables under study?
b) Categorise each variable either as qualitative or quantitative.
c) Categorise each quantitative variable either as discrete or
continuous.
d) Categorise each qualititative variable either as nominal or ordinal.
e) Identify the level of measurement for each variable.

SZS2017
1.2.4.2 Data Summarisation
1) Descriptive statistics (refer Section 1.3)
 Typically used to confirm conjectures about the data.
 Quantitative data: measures of central tendency, measures of
variation (dispersion) and measures of position.
 Qualitative data (non-numeric quality (attribute) or category):
measure the relative frequency for a particular characteristic
and calculate its percentage.

b) Graphical Summary
 Organise the data in some meaningful way by constructing a
frequency distribution (refer Appendix A.1) for quantitative or
qualitative data.
 A frequency distribution is the organisation of raw data in
table form, using classes and frequency
SZS2017
Graphical Statistics
The purpose of graphs in statistics is to convey the data to the viewer in pictorial
form and getting the audience’s attention in a publication or a presentation.

Histogram Frequency Polygon Ogive Bar Chart

Pareto Chart Pie Chart Time Series Graph


SZS2017
Histogram, Frequency
Polygon, Ogive

Histogram Frequency Polygon Ogive


 For quantitative data.  For quantitative data.  For quantitative data.
 Describe grouped  Describe grouped frequency  Represents the cumulative
frequency data data distribution. frequencies for the classes in a
distribution.  Displays the data by using grouped frequency data
 Displays the data by using lines that connect points distribution.
contiguous vertical bars of plotted for the frequencies at  Visually represent how many
various heights to represent the midpoints of the classes. values are below a certain upper
the frequency of the classes.  The frequencies are represented class boundary.
by the heights of the points.
Distribution Shapes for Histogram

Bell-Shaped Uniformed J-Shaped Reverse J-Shaped

Right Skewed Left Skewed Bimodal U-Shaped

SZS2017
Bar Chart, Pareto Chart,
Pie Chart

Bar Chart Pareto Chart Pie Chart


 For quantitative data, the bar  Used to represent a frequency  A circle that is divided into
represents the mean values. distribution for a categorical sections or wedges according
 For qualitative data, the bar variable. to percentage of frequencies in
represents the heights or length  The frequencies are displayed each category of the
whose represents the by the heights of vertical bars distributions.
frequencies of the data. which are arranged in  Pie charts show the relationship
 The bars can be vertical or decreasing order. between classes in a set of data
horizontal. with the whole data.
Stem and Leaf Plot, Time
series graph

Time Series Graph Stem and leaf plot


 Represents data that occur over  The leading digit is plotted as the stem and the trailing digit as the leaf to
a specific period of time. form groups or classes.
 For analysis, we look at the  A key indicator is used to define the stem and leaf values.
trend or pattern (increasing or  If the plot is rotated in horizontal position, we can see the shape of the
decreasing) that occurs over the data distribution
time period.  For a mixture stem and leaf plot, the shape of distribution for the left side
 Further analysis will look at the may be seen by reflecting the plot to the right side.
slope or the steepness of the line  We may analyse the variability of the data by looking at the spread of the
(rapid increase or decrease). stem and leaf plot.
 A stem and leaf plot is also good in showing the range, minimum,
maximum, mode, gaps, clusters, and outliers.
Selection of appropriate statistical
techniques for data summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative Mean, Median, Mode, Histogram, Bar Chart (bar
(ratio scale) Range, Standard Deviation, representing means), stem
Interquartile range (IQR and leaf plot, Boxplot
=Q3-Q1)
Symmetrical Mean, Median, Mode, Histogram, Bar Chart (bar
Distribution Range, Standard Deviation representing means)

Skewed Distribution Median, Range, Interquartile Histogram, Stem and leaf


range (IQR =Q3-Q1) plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart

Categorical Mode, Mean, Counts, Pie Chart, Bar Chart


(Ordinal, Likert Scale) Percentage
SZS2017
1.2.5 Presenting and
Analysing the Data
 Analysed information given by the
 Descriptive statistics (refer topic 1.3)
 Graphical summary (graph and chart)

 Identify if there exist any relationship in the variables under


study.

 Making any relevant statistical inferences


 confidence interval, hypothesis testing, ANOVA, goodness of fit
test, contingency table, regression, correlation, etc.

SZS2017
BASIC INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Confidence Intervals An estimated range of values which is likely to include an unknown population
(CHAPTER 2) parameter, 𝜃 with a specified probability (confidence level) within that interval.
The interval is usually written as 𝒂, 𝒃 or 𝒂 < 𝜽 < 𝒃.
Hypothesis Testing A statement (claim or conjecture or assertion) concerning a parameter or
(CHAPTER 3) parameters of one or more populations.
• Statistical Analysis for one population (mean, variance, proportion)
• Statistical Analysis for two populations (mean, variance, proportion)
Analysis of Variance Statistical Analysis for three or more populations mean
(ANOVA) • One-way ANOVA
(CHAPTER 4) • Two-way ANOVA and Post Hoc Test
Linear Regression A statistical measure that attempts to determine the strength of relationship
Analysis between dependent (y) and independent variables (x).
(CHAPTER 5) • Simple linear regression analysis and correlation. (y vs x)
• Multiple linear regression analysis and correlation. (y vs xi)
• Model selection technique to chose a parsimony model that best fit the data.
Statistical Analysis for 1. Tests concerning frequency distributions for categorical data
Categorical Data (Goodness of Fit)
(CHAPTER 6) 2. Tests concerning specific probability distributions (Goodness of Fit)
3. Test the Independence of two variables (Contingency Table)
4. Test the homogeneity of proportions (Contingency Table)
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Experimental Planning, conducting, analysing and interpreting controlled tests to evaluate the factors
Design (DOE) that control the value of a parameter or group of parameters.
Example: ANOVA, Single factor experiment, Randomized Blocks, Latin Squares and
Related Design, Factorial Design, Response Surface Methodology, Nested and Split-Plot
Design
Time Series Modelling, making inference and producing forecast time series data for future
Analysis observations. Time series models are built to represent the serially correlated series,
trends, or seasonal effects.
Example: Linear Time Series, Linear Stationary Models (AR, MA, ARMA), Linear
Nonstationary Models (ARIMA, SARMA), Box-Jenkins Models, Volatility Models (ARCH,
GARCH), Hybrid models
Multivariate A central tool whenever many variables need to be considered at the same time.
Analysis Example: Mean Vector and Covariance Matrix Estimation, MANOVA, Principal
Component Analysis, Factor Analysis, Canonical Correlation Analysis, Discriminant
Analysis, Cluster Analysis
Statistical Quality Quality improvement through the use of modern statistical methods for quality control
Control (SQC) Example: Variables control charts, Attribute Control Charts, Time-Weighted Control
Charts, Multivariate Control Charts
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Statistical A mathematical equations that relate one or more random variables and possibly
Modelling other non-random variables, concerning the generation of some sample data and
similar data from a larger population.
• Example of Statistical Models: Generalised Linear Model, Dependence model,
Regression, Bayesian, markov chain, Random effect and mixed model
• The Process involve: parameter estimation, data generation, missing values,
outlier detection, simulation study, bootstrap, goodness of fit test
Data Mining A computing process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database system.
Example: Decision Tables, Decision Trees, Classification Rules, Association Rules,
Decision Tress, Clustering, Advanced linear model, Bayesian, Instance-based Learning
Circular Statistics A branch of statistics that involve circular data which deal with direction or cyclic
time. Circular data are measured in degrees (0,2π] or radian (0o, 360o].
Example: orientation of an animal, direction of wind and wave, days of the week,
compass direction, waves of sound, the human perception under various conditions,
the orientation of ridges of fingerprints, the orientation of sand grains from a beach,
the death due to a disease at various times in a year, and astronomical observations.
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Advanced Regression • Polynomial Regression: y is modelled as an nth degree polynomial in x
Analysis
• Multivariate Regression: Y is a matrix with series of multivariate dependent
measurements and X is a matrix of observations on independent variables.
• Generalized Linear Model: A flexible generalization of ordinary linear
regression that allows for response variables that have error distribution
models other than a normal distribution.
• Logistic Regression: A regression model where the dependent variable is
categorical.
• Nonlinear Regression: The observational data are modeled by a function
which is a nonlinear combination of the model parameters and depends on
one or more independent variables
• Error in Variables: a regression model that account for measurement errors
in the independent variables.
1.2.6 Make the decision
and conclusion
 The researchers can make decisions in order to achieve the
objective and goal of the research and choose the best options
which represents the ‘best’ solution to the problem.

 The correctness of this choice depends on the analytical skill of


the researchers and quality of the information.

SZS2017
1.3 REVIEWS ON
DESCRIPTIVE
STATISTICS
 Summarise the data using measures of central
tendency, such as the mean, median, mode, and
midrange.
 Describe the data using measures of variation, such
as the range, variance, standard deviation and
coefficient of variation.
 Identify the position of a data value in a data set
using measures of position such as quartiles, deciles,
and percentiles.

SZS2017
Reviews on
Descriptive Statistics

 Descriptive statistics is typically used to confirm conjectures


about the data.
 We can summarise data using measures of central tendency,
measures of variation, and measures of position.
 Some classified these type of measures as traditional
statistics.
 If the measurement describes about a population
characteristic, it is called a parameter.
 If the measurement describes about a sample characteristic,
it is called a statistic.

SZS2017
RULE OF THUMB FOR DECIMAL
PLACES

1. In general, the calculated parameter or statistic value should


be rounded to four (4) decimal places.

2. If the unit is given (in cm, minute, day, etc.), the value should
be rounded to that unit’s decimal places.

SZS2017
TIPS: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS

STEP 1: Insert data → MODE, SD, insert data, M+, AC


STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1

Casio fx-570ES

STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC


STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
Note:
The notations used in the calculator are n as sample size, x as mean sample, x n or  x as population
standard deviations, and x n 1 or sx as sample standard deviations.
SZS2017
1.3.1 Measures of Central Tendency
 Measures of central tendency are also called measures of
average
1. mean Can roughly describes
2. median the shape of
distribution of a
3. mode, and certain data set
4. midrange.
 The measures of central tendency are use to describe an
entire set of observations with a single value representing the
central or middle value of the data set.

SZS2017
Midrange (MR)
Is a rough estimate of the middle

lowest value  highest value


MR 
2
EXAMPLE 1.3:
1 8
If the data set is 1, 3, 5, 7, 7, 8, then the calculated midrange is, MR   4.5 .
2

Properties of Midrange
 A rough estimate of the average
 Can be affected by one extremely high or low value (outlier).

SZS2017
Mean
Is the sum of the values divided by the total number of values

Population Mean Sample Mean

N n

x i x i
 i 1
, N population size x i 1
, n sample size
N n
If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is   5.1667 if the data is taken from the population.
The value is a true mean or a parameter.
‒ the calculated mean is x  5.1667 if the data is taken from the sample.
The value is a sample mean or a statistic.
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS

STEP 1: Insert data → MODE, SD, insert data, M+, AC


STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1

Casio fx-570ES

STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC


STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
Note:
The notations used in the calculator are n as sample size, x as mean sample, x n or  x as population
standard deviations, and x n 1 or sx as sample standard deviations.
SZS2017
Median
Is the middle number of n ordered data (smallest to largest)

If n is odd If n is even

Median(MD)  x n 1 xn  xn
1
2 Median(MD)  2 2
2

If the data set is 1, 3, 5, 6, 7, then the calculated median is, Median  x  5 .


3

x x
If the data set is 1, 3, 5, 7, 7, 8, then the calculated median is, Median  3
 6.4

SZS2017
Mode
Is the most commonly occurring value in a data series

EXAMPLE 1.4:
a) If the data set are 1, 6, 3, 7, 8, 5 then the mode is not exist.
b) If the data set are 1, 6, 3, 7, 8, 3, 5 then the mode is 3.
c) If the data set are 1, 6, 3, 7, 3, 8, 7, 5, 3, 7 then the mode is 3 and 7.

Properties of Mode
 The mode is used when the most typical case is desired.
 The mode is can be used when the data are nominal.
 The mode is not always unique.
 A data set can have more than one mode, or the mode may not
exist for a data set.
SZS2017
Identify the Shapes of Data
Distribution
Symmetric Positively skewed / Negatively skewed/
right-skewed left-skewed
Mean  Median  Mode Mean  Median  Mode Mean  Median  Mode

→ In reality, median can be greater than mode or mean values.


→ The shape of the distribution may be identified by observing the
position of the mode value.
SZS2017
EXAMPLE 1.3
If the data set is 1, 3, 5, 7, 7, 8, then

‒ the calculated mean is   5.1667 if the data is taken from the


population. The value is a true mean or a parameter.
‒ the calculated mean is x  5.1667 if the data is taken from the sample.
The value is a sample mean or a statistic.
x3  x4
‒ the calculated median is, Median   6.
2
‒ the mode is 7.
‒ the shape of distribution is negatively skewed since
Mean  Median  Mode .

SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS

STEP 1: Insert data → MODE, SD, insert data, M+, AC


STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1

Casio fx-570ES

STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC


STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
Note:
The notations used in the calculator are n as sample size, x as mean sample, x n or  x as population
standard deviations, and x n 1 or sx as sample standard deviations.
SZS2017
Properties of Mean and Median
 The mean is unique, and not necessarily one of the data values.
 The mean is affected by extremely high or low values and if it occurs, the
mean may not be the appropriate average to use.
 The mean is used in computing other statistics, such as variance.
 The mean cannot be computed for an open ended frequency distribution.
 The mean varies less than the median or mode when samples are taken from
the same population and all three measures are computed for these samples.
 The mean is not an appropriate average to use if the shape of distribution is
skewed.
 The median is used when one must find the center or middle value of a data
set.
 The median will make sure that the data values fall into the upper half or
lower half of the distribution.
 The median is affected less than the mean by extremely high or extremely low
values.

SZS2017
EXAMPLE 1.5
An extreme value, let say 21 is added to the data set in Example 1.3. The new
data set are 1, 3, 5, 7, 7, 8, 21. Assume that the data is taken from a sample, then

‒ the calculated mean is 7.4286 or x  7.4286 . The mean is easily affected by


outliers and may not be the appropriate average to use. This new average
value is no longer representing the central of the data set.
‒ the calculated median is 7 or Median  x  7 . This new average value is
4

still representing the central of the data set.


‒ the mode is 7.
1  21
‒ the calculated midrange is, MR   11 . The midrange is easily
2
affected by outliers.
‒ the shape of distribution is positively skewed since mode is the smallest
value as compared with the mean and median values.

An extremely high or low value dataSZS2017


that occur in a data set is called outlier.
EXERCISE 1.3.1
1. Determine the shape of distribution of the following
data.

a) Mean = Mode = Median = 11


b) Mean = 25, Mode = 13, Median = 17
c) Mean = 5, Mode = 73, Median = 17
d) 11.4, 11.6,12.6,12.7, 12.8, 13.3, 13.3, 13.6, 13.7,
13.8

a) symmetric b) right-skewed c) left-skewed d) Mean = 12.88, Median = 13.05, mode = 13.3, left-skewed

SZS2017
EXERCISE 1.3.1
2. The following set of data represents the number of hospitals
for selected countries.

123 108 195 138 115 179 119 148 147 180
146 178 189 108 193 114 179 147 108 128
164 174 128 159 193 175

a) Find the mean, median, mode, and midrange.


b) Is the average values calculated in (a), a parameter or a
statistic? Why?
c) What is the distribution type that describes the data?
d) What is the best measure of average of this set of data?
Why?

a) Mean = 151.3462, Median = 148, mode = 108 b) statistic c) right-skewed d) median

SZS2017
1.3.2 Measures of Variation/Dispersion
 Measures of variation or measures of dispersion are measures
that determine the spread of data values.
1. Range: the simplest measure of variation
2. Variance, and
more meaningful and popular
3. Standard deviation. measures that describes the
4. Coefficient of Variation variability of data

 Measures of variation may help researchers to describe data


more accurately.
 Variance and standard deviation are used quite often in
inferential statistics.

SZS2017
Range (R)
Is the different between the highest value and the lowest value in a
data set

R = highest value - lowest value

EXAMPLE 1.6:
Suppose the data set is 1, 6, 3, 7, 8, 5, then the calculated range is, R  8  1  7 .

Properties of Range
 The simplest measure of variation.
 Easily affected by one extremely high or low value (outliers).

SZS2017
Variance
Is the average of the squares of the distance each value is from the mean.

Population Variance Sample Variance


N n
  xi     x  x 
2 2
i
2  i 1
, N population size
s 
2 i 1
, n sample size
N
n 1

Standard Deviation
Is the square root of the variance

Population standard deviation ,  Sample standard deviation, s


N n

  xi      xi  x 
2 2

 i 1
, N population size s i 1
, n sample size
N n 1
SZS2017
Properties of Variance & Standard Deviation

 The variance is the average of the squares of the distance each value
is from the mean.
 If the data values are near the mean, the variance will be smaller.
 If the data values are far from the mean, the variance will be larger.
 The square distance is used since the sum of the distances will
always be zero.
 Variance is always a positive value.
 There is no unit for the resultant variance.
 Standard deviation is the square root of the variance.
 Standard deviation is measure of deviations of values from the
mean.
 Standard deviation is always positive value.
 The units of standard deviation are similar as the unit of the data.

SZS2017
Coefficient of Variation
Is the standard deviation divided by the mean.

Population CVar Sample CVar


 s
CVar   100%, for population CVar   100%, for sample
 x

Properties of CVar
 The result is expressed as percentage.
 A parameter/statistic that allows user to compare the standard deviations
when the units are different (the variables are different).
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS

STEP 1: Insert data → MODE, SD, insert data, M+, AC


STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1

Casio fx-570ES

STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC


STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
Note:
The notations used in the calculator are n as sample size, x as mean sample, x n or  x as population
standard deviations, and x n 1 or sx as sample standard deviations.
SZS2017
EXAMPLE 1.6
Suppose the data set is 1, 6, 3, 7, 8, 5, then

‒ the calculated range is, R  8  1  7 .


‒ the calculated variance is  2  5.6667 and the standard deviation is   2.3805
if the data is taken from the population. These values are called as parameters.
‒ the calculated variance is s 2  6.8 and the standard deviation is s  2.6077 if the
data is taken from the sample. These values are called as statistics.
‒ the calculated sample mean is, x  5 . Hence the sample coefficient of variation
2.6077
is CVar   100%  52.15% .
5

SZS2017
Why we Need Measures of Variation
• Measures of variation can be a judgment about how well the
measures of average illustrate or depict the data.
• It is also called measure of variation because it can measure the
variability that exists in a data set.
• It can be used when the measures of central tendency do not give
any significant meaning or not needed/practical.

EXAMPLE:
Suppose we wish to compare the performance of two groups of student
in a test. Given that the mean values are the same for both data sets.
In short, you might conclude that these two groups of students are
equally well performed in the test. However, if the data sets are
examined graphically as shown in Figure 1.10, a different conclusion
might be drawn.
SZS2017
Examining Data Sets Graphically

 Both group have same total number of students.


 Students are given the same set of test and the mean of score is
calculated as 66.67 marks for each group of students.
 The mean values are the same but the spread or variation of the
test score is quite different.
 The test score for students from Group B is more consistent and
less variable.
 When the mean values are equal, the larger the data range is, the
more the variable the data.
SZS2017
Comparing Two Data Sets
Smaller standard deviation 1   2  indicate that:
POPULATION 1 is POPULATION 2 is
 Less dispersed  More dispersed
 Less spread  More spread
 Less variable (small variation)  More variable (large variation)
 More consistent  Less consistent
 More precise  Less precise
 More accurate  Less accurate
 Better data  Worse data

Same interpretation is applicable for range and variances

SZS2017
EXAMPLE 1.7
The following data represents the age (in years) of lecturers in two faculties at UMP.

FIST: 24, 25, 26, 27, 30, 31, 31, 32, 36, 40, 43, 44, 45
FKEE: 22, 25, 25, 25, 28, 33, 34, 36, 37, 40, 41, 43, 48, 51, 53

For these sample data sets, find the standard deviations. Then, identify which data set
is more consistent and less dispersed. What can you say about the variation of age for
lecturers in both faculties?

Solution:

sFIST  7.4670 years


sFKEE  9.9460 years
sFIST  sFKEE , so FIST data is more consistent and less dispersed.
The variation of ages for lecturers in FIST is small and less dispersed as
compared to FKEE lecturers.
SZS2017
EXERCISE 1.3.2 (Q1&Q2)
1. Which of the following set of sample data is less variable?

Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
s A  3.6742  sB  7.8493

2. The following set of sample data represents the battery


lifetime (in hours) from two different brands. Which brand of
battery is performed better?

A: 4.2, 6.7, 7.3, 7.5, 8.0, 8.5, 8.7, 8.8, 9.2, 9.3
B: 9.6, 9.7, 9.8, 9.9, 10.1, 10.2, 11.0, 11.0, 11.0, 11.1
s A  1.5 hours  sB  0.6 hours

SZS2017
Comparing Two Data Sets with
different units/variable
 If the two samples do not have the same units of measurement or the
variables are different, the variance and standard deviation for each
sample cannot be compared directly.

 As an example: suppose a car dealer wants to compare the variation


between the number of sales of car for a year and the commission (in
RM) made by the salesperson. It is very clear that these two
variables have two different units.

 Hence, the best way to compare the variability within these two
variables is by using the coefficient of variation.

 It is means that if  CVar 1   CVar 2 , then the variable one is less


variable than the variable two.
SZS2017
EXERCISE 1.3.2 (Q3)
3. The average age of the accountants at a huge company is 31
years with a standard deviation of 4 years. The average
salary of the accountants is RM 44255 per year with a
standard deviation of RM 780. Compare the variations of
age and income.

CVar  age  
 12.90%   CVar income  17.63% 

SZS2017
Other Properties of Standard Deviation
 Use to determine the number of data values that fall within a
specified interval in a distribution.

 The values under curve indicate the percentage of area in each


section or range of data.
 It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
1.3.2.1 Accuracy and Precision
Concept (Validity and Reliability)
→ The concept is important to ensure that data collected from an
experiment or observation is good, valid, and reliable.
 Accuracy is how close a measured  Precision is how close the measured
value to the ‘true’ measurements. value to each other or how consistent
 No measurement/device is your results are for the same
perfect (can easily be inaccurate phenomena over several
and lead to false measurements). measurements.
There is still a tolerance for error.  Precision as a measure of variation
 Accuracy must be accounted for in must be accounted in your
your results. calculations and results.
 The precision of a measurement is the
 The bigger the difference between size of unit used to make a
the measured and the true values, measurement. The smaller the unit,
the less accurate (less valid) the the more precise (more reliable) the
measurement. measurement.
SZS2017
Game of Darts

• A very accurate • Precision • Inaccuracy and • Accurate and


(close to the mark) without imprecision precise.
measurements, but accuracy • Not valid and not • Valid and reliable
not very precise, • Very reliable • Very good
since the darts are consistent, but measurement
spread out not near the
everywhere mark
• Valid but not • Not valid but
reliable reliable
SZS2017
EXERCISE 1.3.2 (Q4)
4. Identify each situation as either accurate or precise or both.

a) If you are playing football and you always hit the left goal post
instead of scoring.
b) A candy manufacturer claims that each packet contains 20 candies.
A sample of packet have 18, 21, 19, 21, 19, 20, 22 candies,
respectively. The average is 20 candies with an error of 1 candy.
c) A manufacturer claims that each chocolate packet contains 20
chocolates. A sample of packets have 17, 18, 18, 17, 18, 17, 17
chocolates, respectively.
d) In an experiment, with five trials, the end results of the five trials for
whatever is being tested are: 35 kg, 36 kg, 36 kg, 35 kg, 36 kg. The
actual value (as found in a scientific data book) is meant to be 42 kg.
e) In an experiment, with five trials, the average value is 35 kg. The
actual value (as found in a scientific data book) is meant to be 35 kg.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?

5. Which do you think has more variation: the IQ scores of 30 students


in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?

6. Explain why median and interquartile range are more appropriate


measures as compared to mean and variance for non-normal data.

7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?

SZS2017
MIND EXPANDING EXERCISES
8. In an analysis of the accuracy of weather forecasts, the actual high
temperature are compared to the high temperatures predicted one day earlier
and the temperatures predicted five days earlier. Listed below are the errors
between the predicted temperatures and the actual high temperatures for 14
consecutive days in Kuala Lumpur.

Actual high ‒ 2 2 0 0 ‒ 3 ‒2 1
High predicted one day earlier ‒2 8 1 0 ‒ 1 0 1
Actual high ‒ 0 ‒3 2 5 ‒ 6 ‒9 4
High predicted five days earlier ‒1 6 ‒2 ‒2 ‒ 1 6 ‒4

a) Do the means and medians of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
b) Do the standard deviations of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?

SZS2017
ME.8 (solution)

Mean median sd
1.5000 1.0000 2.4152
3.8333 4.5000 2.4014

SZS2017
MIND EXPANDING EXERCISES
9. A data set consists of 20 values that are fairly close together. Another
value is included, but this new value is an outlier (very far away from
the other values). How is the standard deviation affected by the
outlier? No effect? A small effect? Or a large effect?

10. Suppose scores on psychological test have a mean of 90 and a standard


deviation of 10. Meanwhile, scores on the economics test have a mean
of 55 and a standard deviation of 5. Which is relatively better: a score
of 85 on a psychological test or a score of 45 on an economics test?

11. When designing the production procedure for batteries used in heart
pacemakers, an engineer specifies that “the batteries must have a
mean life greater than 10 years, and the standard deviation of the
battery life can be ignored.” If the mean battery life is greater than 10
years, can the standard deviation be ignored? Why or why not?

SZS2017
1.3.3 Measures of Position
Describe where a specific data value falls within the data set or its
relative position based on percentiles, deciles and quartiles in
comparison with other data values
Describing the position of
the data value
(increasing order)

Percentiles Deciles Quartiles


Split data into Split data into Split data into
100 equal parts 10 equal parts 4 equal parts

Pi  x in  xc Di  x in  xc Qi  xin  xc
100 10 4
SZS2017
Pi  x in  xc Di  x in  xc Qi  xin  xc
100 10 4

 If c is not a whole number, round it up to the next whole number.


xc  xc 1 xc  xc 1 xc  xc 1
 If c is a whole number, then use Qi  , Di  , Pi 
2 2 2
SZS2017
EXAMPLE 1.9
A manufacturer measured the volume of a sample of 11 bottles of chemical
solvents. The results are recorded (in millilitres) as follows.
40 45 38 25 42 31 30 44 26 27 36
Show that Q1 equivalent to P25 , Q2 equivalent to P50 , Q3 equivalent to P75 , and Di
equivalent to Pi (10) , where i  1, 2, , 9.

The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45

Quartiles Percentiles
Q1  x1 11  x2.75  x3  27 P25  x 25 11  x2.75  x3  27
4 100
Q2  x 2 11  x5.50  x6  36 P50  x 50 11  x5.50  x6  36
4 100
Q3  x 311  x8.25  x9  42 P75  x 75 11  x8.25  x9  42
4 100

Summary: Q1 equivalent to P25 ; Q2 equivalent to P50 ; Q3 equivalent to P75 .

SZS2017
EXAMPLE 1.9

The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45

Deciles Percentiles
D3  x 3 11  x3.3  x4  30 P30  x 30 11  x3.3  x4  30
10 100
D5  x5 11  x5.5  x6  36 P50  x50 11  x5.5  x6  36
10 100
D7  x 711  x7.7  x8  40 P70  x 70 11  x7.7  x8  40
10 100

Summary: Di equivalent to Pi (10) , where i  1, 2, 3, 4, 5, 6, 7, 8, 9 .

SZS2017
EXERCISE 1.3.3
1. Given a set of data as 9 2 1 4 3 7 5 4 6 .

a) Find the value corresponds to 4th deciles.


b) Find the value corresponds to 3rd quartiles.

2. A teacher gives a 25-point test to ten students. The scores


are shown below.
9 22 11 14 13 3 7 15 18 16

a) Find the score corresponds to 20th percentiles.


b) Find the score corresponds to 7th deciles.

1) 4, 6 2) 8, 15.5

SZS2017
Why We need Measures of Position?
 Percentiles are one of measures of position that often used in
educational and health related fields to indicate the position
of an individual in a group.
 Percentile is not a percentage value. The ith percentile, is a
value that i % of the data are less than or equal to Pi and
(100-i) % are greater than or equal to Pi.

EXAMPLE:
If a student obtained 82 marks over 100 in a test , he/she will
obtain 82% of score. However, there is no indication of his/her
position with respect to the rest of the class. On the other hand,
if his/her score corresponds to the 75th percentile, then he/she
did better than 75% of the students in his/her class.
SZS2017
Why We need Measures of Position?
Quartiles can be used as a rough measurement of variability.

INTERQUARTILE RANGE (IQR)


 defined as the difference between Q1 and Q3 and is the range
of the middle 50% of the data.
 used to identify outliers, and to measure variability in
exploratory data analysis (Section 1.4).
 the smaller the value of IQR; the smaller the variation in the
data.
 useful to show the variability of the data set, either its more
variation, more dispersed, more spread or more consistent.

SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?

5. Which do you think has more variation: the IQ scores of 30 students


in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?

6. Explain why median and interquartile range are more appropriate


measures as compared to mean and variance for non-normal data.

7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?

SZS2017
1.3.4 Descriptive Statistics
Using Microsoft Excel

SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ x1  x2 , in average Assembly Line 2 produced more
number of finished units per day.

→ s1  s2 , R1  R2 and  s.e 1   s.e 2 . The arrangements


of Assembly Line 1 is more consistent, less dispersed,
less spread, less variable (small variation), and more
precise. Therefore the arrangements of Assembly
Line 1 will be adopted as the new arrangement.

→ For Assembly Line 1, the distribution of data is


negatively skewed or left-skewed since
Mean  Median  Mode . The skewness value is
negative too.

→ For Assembly Line 2, the distribution of data is also


negatively skewed or left-skewed since the mode is
the highest value compared to mean and median. The
skewness value is negative too.
SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.

→ The skewness value for Assembly Line 2 is higher


that the Assembly Line 1. Hence the distribution of
data from Assembly Line 2 is more skewed to the
left, indicating that Assembly Line 2 produced more
number of finished units per day.

→ For Assembly Line 1,


x1  Confidence Level  491.1  17.1   474,508.2  .
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 1 is lies between 474 and 509 units.

→ For Assembly Line 2,


x2  Confidence Level  499.4  25.2   474.2,524.6 
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 2 is lies between 475 and 525 units.
SZS2017
MIND EXPANDING EXERCISES
12. A lecturer is interested to investigate the students’ performance in
statistics course based on their carry mark and the final score in
the final examination. The descriptive statistics and graph are
given below. From the analyses, comment on the students’
performance based on carry marks and final examination scores.

SZS2017
MIND EXPANDING EXERCISES
ME.12

SZS2017
MIND EXPANDING EXERCISES
13. A study is conducted to compare the performance of male and female
students in the statistics course for final examination scores. The
data, descriptive statistics and graph of the final examination scores
are presented as follow. Based on the analysis, answer the following
questions:

72 62 83 65 60 74 66 68 57 63 61
Female
76 60 78 34 70 59 63 86 43 90 87
58 81 86 68 70 77 54 54 72 41 33 52
Male
70 37 67 39 74 32 8 33 27 23 54

SZS2017
MIND EXPANDING EXERCISES
ME.13
a) State the mean and standard deviation for both groups and give your
comment.
b) Based on the graph shown, give your comment.

SZS2017
MIND EXPANDING EXERCISES
14. People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:

a) How many data in each group?


b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?

SZS2017
MIND EXPANDING EXERCISES
Before After
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Key: 14|1=141
SZS2017
1.4 EXPLORATORY
DATA ANALYSIS

 Identify outliers.
 Draw and interpret a boxplot.

SZS2017
Exploratory Data Analysis
Traditional Method Exploratory Data Analysis
Frequency distribution Stem and leaf plot
Histogram Boxplot
Mean Median
Interquartile range
Standard deviation
(IQR=Q3-Q1)
 The purpose of exploratory data analysis is to discover any gaps or
pattern in the data.
 For symmetric data, the appropriate measure of central tendency
is mean and for variability is standard deviation or variance.
 For skewed data, the appropriate measure of central tendency is
median and for measure of variability is interquartile range (IQR).
SZS2017
RECALL: Selection of appropriate
statistical techniques for data
summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative Mean, Median, Mode, Histogram, Bar Chart (bar
(ratio scale) Range, Standard Deviation, representing means), stem
Interquartile range (IQR and leaf plot, Boxplot
=Q3-Q1)
Symmetrical Mean, Median, Mode, Histogram, Bar Chart (bar
Distribution Range, Standard Deviation representing means)

Skewed Distribution Median, Range, Interquartile Histogram, Stem and leaf


range (IQR =Q3-Q1) plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart

Categorical Mode, Mean, Counts, Pie Chart, Bar Chart


(Ordinal, Likert Scale) Percentage
SZS2017
Histogram, Stem and Leaf OR Boxplot?
Type of Graph Advantages Disadvantages
Histogram ‒ Can graph huge data sets easily. ‒ Not good for small data set.
‒ The shape of distribution can be easily ‒ It is difficult to simplify all
described. the data into one scale.
‒ You could change the intervals of the
histogram to see which gives a better
description of the data.
‒ Great for comparing data.
‒ Can show trends in the data clearly.
Stem and Leaf ‒ Very easy to construct. ‒ Not good for small data set
‒ Show the real value of data or very large data set.
‒ Can shows range, minimum & ‒ Not visually appealing.
maximum, gaps & clusters, and ‒ Does not easily indicate
outliers easily. measures of centrality for
‒ May observe the mode. large data sets.
‒ Can identify the shape of distribution.
Boxplot ‒ Good for small or large data sets. ‒ Original data is not clearly
‒ It displays the range and distribution shown in the box plot.
of data along a number line. ‒ Mean and mode cannot be
‒ Can shows outliers. identified in a box plot.
SZS2017
1.4.1 Outliers
 Outlier is an extremely high or an extremely low data value when
compared with the rest of the data values.
 Outliers can happen from:
 the result of measurement or observational error,
 the written or typing error,
 the data value obtained from a subject that is not in the defined
population, or
 the legitimate data value occurred by chance.
 When a distribution is symmetric or normal, data values that are
beyond three standard deviations of the mean can be considered
as suspected outliers (refer Figure 1.11).
 An outlier can strongly affect the mean and standard deviation of a
variable.
SZS2017
Recall: Other Properties of Standard Deviation
 Use to determine the number of data values that fall within a
specified interval in a distribution.

 The values under curve indicate the percentage of area in each


section or range of data.
 It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
Position of Outliers
A data value x is an outlier if it less than the lower boundary value or
exceed the upper boundary value for the data set.

SZS2017
EXAMPLE 1.11
The number of credits in business courses for eight job applicants is
shown here:
9, 12, 15, 27, 33, 45, 63, 72.
Find the first and third quartiles for the above data. Is there any
outlier on the above data?

x2  x3
Q1  x18  x2   13.5
4
2

x6  x7
Q3  x 38  x6   54
4
2

lower boundary: Q1  1.5  Q3  Q1   13.5  1.5(54  13.5)  47.25

upper boundary: Q3  1.5  Q3  Q1   54  1.5(54  13.5)  114.75

→ Since 47.25  x  114.75 , thus there is no outlier.


SZS2017
EXERCISE 1.4.1
1. Given 19 2 1 4 3 7 5 4 6 . Find outliers if any.

Q1  3, Q3  6; 19 is outliers

2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find


outliers if any.

Q1  5, Q3  11; 21 is outliers

SZS2017
MIND EXPANDING EXERCISES
14. People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:

a) How many data in each group?


b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?

SZS2017
MIND EXPANDING EXERCISES
Before After
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Key: 14|1=141
SZS2017
1.4.2 Boxplots
Boxplot (Box and Whiskers plot) is graphical representations of a five-
number summary of a data set and outliers.
 The lowest value of data set (minimum)
 The lower quartile Q1 (1st Quartile or 25th percentile)
 The median (2nd Quartile or 50th percentile)
five-number
summaries
 The upper quartile Q3 (3rd Quartile or 75th percentile)
 The highest value of data set (maximum) + Outliers
 Outliers

SZS2017
Types of Boxplots

A Horizontal boxplot

A Vertical boxplot

SZS2017
SZS2017
EXAMPLE 1.12
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4 [key: 3|4 → 34]
Given that for School B, Q1  36, Q2  42, Q3  47 and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions

School A School B
Minimum 24 22
1st quartile Q1  x114   x3.5  x4  27 Q1  36
4

2nd quartile/ x7  x8 Q2  42
Median Q2   30.5
2
3rd quartile Q3  x 314   x10.5  x11  36 Q3  47
4

Maximum 38 54
Outliers Q1  1.5  Q3  Q1   27  1.5(36  27)  13.5 no outlier
Q3  1.5 Q3  Q1   36  1.5(36  27)  49.5
Since 57 > 49.5, Thus 57 is an outlier.
SZS2017
Information Obtain from a Boxplot
1. If the median is near the centre of the box, the distribution is approximately
symmetric.
2. If the median falls to the left of the centre of the box, the distribution is positively
skewed.
3. If the median falls to the right of the centre of the box, the distribution is
negatively skewed.

 Suppose the median is near the centre of the box (approximately symmetric):
4. If the lines are about the same length, the distribution is approximately
symmetric.
5. If the right line is larger than the left line, the distribution is positively skewed.
6. If the left line is larger than the right line, the distribution is negatively skewed.

 If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared using their central tendency (average) and
variability values.
 To compare the average, use the location of the medians.
 To compare the variability, useSZS2017
the length of the IQR.
EXAMPLE 1.12
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4 [key: 3|4 → 34]
Given that for School B, Q1  36, Q2  42, Q3  47 and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions

School A School B
Minimum 24 22
1st quartile Q1  x114   x3.5  x4  27 Q1  36
4

2nd quartile/ x7  x8 Q2  42
Median Q2   30.5
2
3rd quartile Q3  x 314   x10.5  x11  36 Q3  47
4

Maximum 38 54
Outliers Q1  1.5  Q3  Q1   27  1.5(36  27)  13.5 no outlier
Q3  1.5 Q3  Q1   36  1.5(36  27)  49.5
Since 57 > 49.5, Thus 57 is an outlier.
SZS2017
EXAMPLE 1.12 solution

Shape:
Based on the location of median, School A has right-skewed distribution where most of
teachers’ age is concentrated at the lower age (< 30 years old). However, School B has
left-skewed distribution where most of teachers’ age is greater than 42 years old.

Average:
Based on the median value, 50% of teacher at School A age less than 30.5 years old
whereas 50% of teacher at School B age less than 42 years. On average, teachers at
School B is older than the teachers at School A.
SZS2017
EXAMPLE 1.12 solution

Variability:
Based on the IQR value, for School A, IQRA = 9 years where most 50% of the teachers
age between 27-36 years old. Meanwhile, for School B, IQRB = 11 years where most
50% of the teachers age between 36-47 years. Hence, the variation of teachers’ age at
School B is higher than age of teacher at School A (IQRA < IQRB).

Range:
Without outlier, teachers’ age at school A varies less from minimum age of 24 years to
maximum age of 38 years as compared to School B with minimum age of 22 years to
maximum of 54 years.
SZS2017
Boxplot for Special Case
 In some cases, we cannot use the general guideline as given above to interpret the
boxplot.
 Boxplot is not the best graphical representation to describe a data set if the sample
size of the data set is too small.
 The existence of outliers also may affect the boxplot.
 Therefore, in such cases, we have to use the descriptive statistics to identify the
distribution of the data set.

SZS2017
EXERCISE 1.4.2 (Q1)
1. Plot a boxplot for the following data. Then describe the data.

a) 3.2, 5.9, 4.3, 6.9, 4.5, 8.0, 4.7, 8.9, 5.7, 11.9

Min  3.2, Q1  4.5, Q2  5.8, Q3  8, no outlier, Max  11.9, right-skewed

b) 5.8, 9.7, 6.7,13.4, 6.8, 14.7, 7.2, 16.4, 8.2, 28.1

Min  5.8, Q1  6.8, Q2  8.95, Q3  14.7,28.1 is outlier, Max  16.4, right-skewed

SZS2017
1.4.2 (Q1) solution

Min  3.2, Q1  4.5, Q2  5.8, Q3  8, no outlier, Max  11.9, right-skewed

Min  5.8, Q1  6.8, Q2  8.95, Q3  14.7,28.1 is outlier, Max  16.4, right-skewed

SZS2017
EXERCISE 1.4.2(Q2)
2. Two samples of ten springs made out of the steel rods supplied by
two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows. Compare
the distributions using box-plots.

Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7


8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0
11.0 11.0 11.1

Give comment on the flexibility of springs supplied by two different


companies.

Company A : Min  6.7, Q1  7.3, Q2  8.25, Q3  8.8, 4.2 is outlier, Max  9.3, left-skewed
Company B : Min  9.6, Q1  9.8, Q2  10.15, Q3  11.0, no outlier, Max  16.4, right-skewed

SZS2017
1.4.2 (Q2) solution
EXERCISE 1.4.2 (Q3)
3. The following Table presents viscosity (in Pascal) of chemical substance from
three (3) batches of chemical process.

Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9

a) Complete the table below by showing all the necessary calculations.

Measures of position Batch A Batch B Batch C


1st quartile 14.30 14.10
Median 14.55 14.55
3rd quartile 15.40 15.80
Outlier No No

b) Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape and variability.

Batch A : Q3  15.2, right-skewed; Batch B : Q2  15.05, no outlier, left-skewed; Batch C : Q1  14.1, right-skewed

SZS2017
1.4.2 (Q3) solution
17

16.5

16

15.5

15

14.5

14

13.5

13

12.5

12
Batch A Batch B Batch C
MIND EXPANDING EXERCISES
ME.15

SZS2017
MIND EXPANDING EXERCISES
15. An experiment was conducted to assess the potency of various constituents of
orchard sprays in repelling honeybees. Individual cells of dry comb were filled
with measured amounts of lime Sulphur emulsion in sucrose solution. Seven
different concentrations of lime Sulphur ranging from a concentration of 1/100
to 1/1,562,500 in successive factors of 1/5 were used as well as a solution
containing no lime Sulphur (A, B, C, D, E, F, G, H). The responses for the
different solutions were obtained by releasing 100 bees into the chamber for
two hours, and then measuring the decrease in volume of the solutions in the
various cells. Based on the figure below, answer the following questions:
a) Which concentration has outlier(s)?
b) Group the concentration according to their shape of distribution.
c) Which concentration has the most consistent data? Why?
d) Which concentration has the most variable data? Why?
e) H is the concentration of ‘no lime sulphur’. What is the use of
concentration H?
f) What conclusion can you draw from this experiment?

SZS2017
1.5 NORMAL
PROBABILITY PLOT

 Draw and interpret a normal probability plot.

SZS2017
Normal Probability Plots
 The easiest way to check whether the sample distribution is normal or not.
 The most plausible normal distribution is the one whose mean and standard deviation
are the same as the sample mean and standard deviation.

STEP 1 : Sort the data in ascending order and denote each sorted data as
xi , i  1, , n.
STEP 2 : Numbered the sorted data from i to n.
i  0.5
STEP 3 : Calculate the probability value for each xi using pi  .
n
STEP 4 : Plot pi versus xi.

If the sample points lie approximately on a straight line,


the data is approximately normally distributed.

SZS2017
Testing Normality using
Software
 Other than plot manually, we can obtain it from software such as SPSS,
Minitab, Excel, and etc. The normality of the data also can be tested by
using Kolmogorov Smirnov, Anderson Darling or Shapiro-Wilk Tests.

SZS2017
EXAMPLE 1.13

→ The graph pi versus xi from the


figure above is known as the
normal probability plot. Since the
data lies approximately on a
straight line, the data is normally
distributed.

SZS2017
EXERCISE 1.5
1. A sample of size six is drawn. The sample, arranged in
increasing order, is
3.01 3.35 4.79 5.96 7.89 9.15
Do these data appear to come from an approximately normal
distribution?

2. The data shown represent the number of movies in America for


14-year period.
2084 1497 1014 910 899 870 859
848 837 826 815 750 737 637
Do these data appear to come from an approximately normal
distribution?

1) yes 2) no

SZS2017
1.5 (Q1) solution

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10

SZS2017
1.5 (Q2) solution
1.2000

1.0000

0.8000

Pi 0.6000

0.4000

0.2000

0.0000
0 500 1000 1500 2000 2500
xi

SZS2017
CONCLUSION
• The applications of statistics are
many and varied. People
encounter them in everyday life,
such as in reading newspapers or
magazines, listening to the radio,
or watching television.

• By combining all of the


descriptive statistics techniques
discussed in this chapter
together, the student is now able
to collect, organize, summarize
and present data.

Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval

SZS2017
REFERENCES
1. Walpole R.E., Myers R.H., Myers S.L. & Ye K. 2011. Probability and Statistics for Engineers
and Scientists. 9th Edition. New Jersey: Prentice Hall.
2. Navidi W. 2011. Statistics for Engineers and Scientists. 3rd Edition. New York: McGraw-Hill.
3. Triola, M.F. 2006. Elementary Statistics.10th Edition. UK: Pearson Education.
4. Bluman A.G. 2009. Elementary Statistics: A Step by Step Approach. 7th Edition. New York:
McGraw–Hill.
5. Weiss, N.A. 2002. Introductory Statistics. 6th Edition. United States: Addison-Wesley.
6. Sanders D.H. & Smidth R.K. 2000. Statistics: A First Course. 6th Edition. New York: McGraw-
Hill.
7. Crawshaw, J. & Chambers,J. 2001. A Concise Course in Advance Level Statistics with Work
Examples, 4th Edition, Nelson Thornes.
8. Satari S. Z. et al. Applied Statistics Module New Version. 2015. Penerbit UMP. Internal used.

Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval

SZS2017

You might also like