Chapter 1 Introduction To Statistics PDF
Chapter 1 Introduction To Statistics PDF
Chapter 1 Introduction To Statistics PDF
INTRODUCTION TO STATISTICS
Expected Outcomes
Able to define basic terminologies of statistics.
Able to apply the basic steps in the statistical problem-solving
methodology for various applications.
Able to summarise and analyse data using measures of central
tendency, measures of variation and measures of position.
Able to relate the concept of accuracy and precision of data using game
of darts.
Able to conduct exploratory data analysis that includes numerical data
analysis and various graphical displays.
Able to plot and interpret normal probability plot.
SZS2017
CONTENT
1.1 Statistical Terminologies
1.2 Statistical Problem Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.2.1 Accuracy and Precision
1.3.3 Measures of Position
1.3.4 Descriptive Statistics Using Microsoft Excel
1.4 Exploratory Data Analysis
1.4.1 Outliers
1.4.2 Box Plot
1.5 Normal Probability Plot
SZS2017
1.1 STATISTICAL
TERMINOLOGIES
Define the meaning of statistics, population,
sample, parameter, statistic, descriptive statistics
and inferential statistics.
Discuss the importance of statistics in daily lives.
SZS2017
1.1.1 What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
Ten thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
The death rate from lung cancer was 10 times higher for smokers compared
to nonsmokers.
The average cost of a wedding is nearly RM10,000 in Malaysia.
In Malaysia, the median salary for men with a bachelor’s degree is
RM 30,000 per year, while the median salary for women with a bachelor’s
degree is RM 29,000 per year.
Globally, an estimated of 500,000 children under the age of 15 live with Type
1 diabetes.
Women who eat fish once a week are 29% less likely to develop heart disease.
SZS2017
What is Statistics?
The sciences of conducting studies to collect, organise, summarise,
analyse, present, interpret and draw conclusions from data.
Collection and analysis of data are the most important part in research
methodology.
Researchers must have a basic knowledge of statistics before starting any
research or study involving data analysis.
Statistics is also used to analyse the results of surveys and as a tool in
scientific research to make decisions based on controlled experiments,
estimation, prediction, and quality control.
SZS2017
1.1.2 Why we Need Statistics?
Basic knowledge of statistics is needed in any disciplines or any field of
research or study (in almost all fields of human endeavour) that involve data
analysis.
The methods of statistics allow the researchers to design a valid experiment
and finally draw a reliable conclusion or interpretation from the data they
produced and analysed.
Examples:
SZS2017
1.1.3 Population and Sample
Population (N)
Tangible
A complete collection of finite and the total number of
measurements, outcomes, objects or subjects is fixed and could be listed
individuals under study. → all computers in a room, all female
students in a university, or all electrical
components manufactured in a day, etc.
Conceptual (Intangible)
all values that might possibly have
been observed and has an unlimited
number of subjects.
→ simulated data from computer or
Sample (n) instrument, number of germs on human
A subset of the population that body, all experimental data such as all
measurements of length of metal rod, etc.
is observed
SZS2017
Parameter and Statistic
Parameter Statistic
A numerical value that represents a A numerical value that represents a
certain population characteristic certain sample characteristic
The average of weight of students from a The average of weight for a sample of
population of students in a university female students selected from all students in
The percentage of defective components in a university
a population of electrical components The percentage of defective components in
manufactured in a day a sample of 100 electrical components
Mean (Average) x
Variance 2 s2
Standard deviation s
Proportion p
SZS2017
EXAMPLE 1.1
A travel agent claims that the average number of rooms in large hotels in
Pahang is 500 and the standard deviation is 165. A sample of seven hotels in
Genting Highlands is selected and the average number of rooms is found to be
435 with standard deviation of 15.
Based on the above example:
SZS2017
EXERCISE 1.1.3
The number of first year students at a residential college is 317 students. An IQ
pre-test is given to all of them in their first week. The dean of admission
collected data on 27 of them and found their mean score on the IQ pre-test was
51. The mean for the entire first year students was estimated to be
approximately 51. A subsequent computer analysis of all first year students
showed that the true mean (population mean) is 52.
Based on the above statement, answer the following questions.
a) 317 first year students b) tangible c) 27 first year students d) IQ pre-test score e) 52 f) 51
SZS2017
1.1.4 Descriptive and
Inferential Statistics
Descriptive statistics Inferential statistics
Includes the process of data collection, Involves a process of generalisation,
data organisation, data classification, estimations, hypothesis testing, predictions
data summarisation, and data and determination of relationships between
presentation obtained from the sample.
variables.
Used to describe the characteristics of
the sample. Used to describe, infer, estimate,
Used to determine whether the sample approximate the characteristics of the target
represents the target population by population.
comparing sample statistic and Used when we want to draw a conclusion
population parameter. for the data obtain from the sample.
EXAMPLE: EXAMPLE:
Ten thousands parents in Malaysia have The death rate of lung cancer was 10 times
chosen Takaful Insurance as their higher for smokers compared to
trusted life insurance agency. nonsmokers .
SZS2017
Overview of descriptive
and inferential statistics
SZS2017
EXERCISE 1.1.4
SZS2017
1.1.5 Role of the Computer in Statistics
1. Spreadsheets
Microsoft Excel & Lotus 1-2-3
2. Statistical Packages
AMOS, eViews, MINITAB, R, SAS, SmartPLS,
SPSS and SPlus
SZS2017
Data Analysis Application Tools in EXCEL
2. Formulas
SZS2017
Chose
Analysis
ToolPak
and click
Go
SZS2017
Tick Analysis
ToolPak
and click ok
SZS2017
→ Now we can use the Data Analysis
Application in Microsoft Excel to analyse data.
SZS2017
1.2 STATISTICAL
PROBLEM- SOLVING
METHODOLOGY
Outline the six basic steps in the statistical
problem-solving methodology.
Identify various sampling methods.
Classify type of data and level of measurement.
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
1.2.1 Identify the Problem or Opportunity
The researchers must clearly understand and define the objective of the study
before conducting any research. Possible questions that could be asked before
starting any study are given as follows.
SZS2017
Characteristics of Sample Size
The larger the sample size, the smaller the magnitude of sampling errors
would be.
Studies using survey method need a larger sample size since the survey is
a voluntarily based.
Studies using mail response need a much larger sample size. Normally,
the response is as low as 20%-30% responses.
The ideal sample size in a study should be large enough to serve as an
adequate representative of the population in order to generalise the
overall population.
The optimal sample size depends on statistical distribution used and for
the purpose of generalisation to the whole population.
Researcher may refer to Krejcie and Morgan (1970) as a guideline to
obtain an adequate sample size.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
Data must be collected as complete as possible, accurate & relevant to the
problem in order to solve the problem.
SZS2017
Interviews method
The purpose of interview in collecting data is to find out what is in or on
someone else’s mind.
Interview data can easily become biased and misleading if the interviewed
person is aware of the perspective of the interviewer.
It is very important to make sure the person being interviewed does not
hold any preconceived notions regarding the outcome of the study.
Interviews range from quite informal and completely open-ended to very
formal with the questions predetermined and asked in a standard manner.
Usually, interviews are used to gather information regarding an individual’s
experience and knowledge; his/her opinions, beliefs, and feelings, and
demographic data.
Example: An interviewer is interested to gather information on the way
nurses organise their care in hospital wards and conduct an interview
session.
SZS2017
Other Methods of Data Collection
• Questionnaires and surveys (Quantitative + Qualitative).
• Opinions (Qualitative + Quantitative).
• Projective technique and psychological tests (both).
• Proxemics – Study of people’s use of space and their relationship to
culture.
• Kinetics – Study of body movement or people communicate
nonverbally.
• Street Ethnography – Concentrate on a person becoming a part of
the place under study.
• Narratives – Study people’s individual life stories.
• Triangulation – The used of multiple data collection techniques
(Triangulation of data permits the verification and validation of
qualitative data.
SZS2017
EXERCISE 1.2.2
Identify each of the following studies as being either observational or
experimental.
a) Subjects were randomly assigned to two groups, and one group was
given a herb and the other group a placebo. After 6 months, the
numbers of respiratory tract infections each group were compared.
b) A researcher stood at a busy intersection to see if the colour of an
automobile a person drives is related to running red lights or not.
c) A researcher finds that people who are more hostile have higher
total cholesterol levels than those who are less hostile.
d) Subjects are randomly assigned to four groups. Each group is
placed on one of four special diets—a low-fat diet, a high-fish diet, a
combination of low-fat diet and high-fish diet, and a regular diet.
After 6 months, the blood pressures of the groups are compared to
see if diet has any effect on blood pressure or not.
SZS2017
1.2.3 Collecting the Data
(Sampling Techniques)
Sampling is a process of selecting few samples from a population to
become the basis for estimating or predicting the prevalence of an
unknown piece of information, situation or outcome regarding the
bigger group.
i. Non-probability sampling (judgment, voluntary, convenience):
• Sample collected based on the judgment of the experimenter.
• Resulting samples might be biased.
ii. Probability sampling (random, systematic, stratified, cluster):
• The chances is known before the sample is picked.
• Resulting samples are unbiased.
Voluntary
Nonprobability
sampling Convenience
Snowball
Others
Sampling Quota
Techniques Random
Systematic
Probability
Cluster
sampling
Stratified Multi-stage
Others K-Sampling
Nested
SZS2017
A. Nonprobability Sampling Methods
Non-probability Sampling Methods Example
Judgment sampling A political campaign manager intuitively
Data is selected based on opinion of one or picks certain voting districts as reliable
more experts. places to measure the public opinion of his
candidates.
Voluntary sampling
Questions are posed to the public by A call-in radio show asks their listeners to
publishing them over radio or television via participate in surveys on controversial
phone, short message, email etc. The topics such as abortion, affirmative action,
resulting sample tends to over represent gun control, politic, etc.
individuals who have strong opinions.
Convenience sampling
The data selected is an “easy sample”, A surveyor will stand in one location and
haphazard or accidental sampling. ask passerby the questions.
The researcher obtains units or people who
are most conveniently available.
SZS2017
B) Probability Sampling Methods
1. Random sampling
• Each data is numbered, and then the
data is selected using chance or
random method such as random
number.
• When a sample is chosen at random,
it is said to be an unbiased sample.
• Random sample can be selected with
or without replacement.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her
university. There are 5000 students enrolled at the university, and he/she wants to draw a
sample of size 100 to take a physical fitness test.
She could obtains a list of all 5000 students, numbered it from 1 to 5000 and then
randomly invites 100 students corresponding to those numbers to participate in the study.
SZS2017
Generating Random Number
• Generating random number is an important step in obtaining
random sample.
• In random number, each number has equal chance to be selected.
• Random number can be generated from calculator, softwares, or
random number table.
SZS2017
B) Probability Data Samples
2. Systematic sampling
• A set of data is numbered from 1 to N .
x1, x2 ,
, xN
• The first data is selected randomly within
number 1 and k where k=N/n and n
sample size.
• The next number are selected every k
interval to produce n samples.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her university
and he/she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list
of all 5000 students, numbered it from 1 to 5000 and randomly picks one of the first 50 voters
(k=5000/100) on the list. If the first picked number is 30, then the 30th student in the list
should be invited first. Then she should invite every 50th name on the list after this first
random number starts (the 80th student, the 130th student and so on) to produce 100 samples
of students to participate in the study.
SZS2017
B) Probability Data Samples
3. Stratified sampling
• The population is divided into groups
according to some characteristic that is
important to the study, and then the sample
is selected from each group using random or
systematic sampling.
• The characteristics are homogeneous
(similar) within each group but
heterogeneous (dissimilar) among the groups
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between male and female students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
The random method or systematic method can be used to select the participants. As an
example, she use random sample to choose 50 male students and use systematic method
to choose another 50 female students or otherwise.
SZS2017
B) Probability Data Samples
4. Cluster sampling
• The population is divided into groups or
clusters, then some of those clusters are
randomly selected and all members from
those selected clusters are chosen.
• Cluster sampling can reduce cost and time.
• Each cluster has heterogeneous
characteristic but has homogeneous
characteristic among the clusters.
• We can choose more than one cluster.
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between 1st year, 2nd year, 3rd year and senior students. To account for this variation in
lifestyle, the population of student can easily be clustered into four categories.
Then, she can choose any clusters and chose all students in that clusters as the
participants. For example, all 2nd year students are chosen as the participants.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Judgement When the population - Fast and conclusive. - Biased since it based on
Sampling is too large. opinion of one or more
expert only.
Voluntary When the members - Fast response. - Samplings are too
Sampling of the population are - Easy to obtain lager random.
convenient to be sample sizes. - Sometimes not reliable.
sampled. - Degree of generalisability
is questionable.
Convenience When the members - Fast and easy. - Samplings are too
Sampling of the population are - Convenience and random.
convenient to be inexpensive. - Sometimes not reliable,
sampled. - Degree of generalisability
is questionable.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Random When the members of - Use table of random - High cost.
Sampling the population are number. - Time consuming for large
similar to one another - Each data has an equal sample size.
on important chance to be selected. - Tedious.
variables. - Ensures a high degree of
representativeness.
Systematic When the members of - Relatively easy to - There is a risk of data
Sampling the population are construct, execute, manipulation.
similar to one another compare and understand. - Not the best method if the
on important variables - The process can be researcher does not know
controlled. the background of the
- Good for tight budget population.
research. - Less random than simple
- Ensures a high degree of random sampling.
representativeness.
- No need to use a table of
random number.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling When to Use? Advantages Disadvantages
Techniques
Stratified When the population - Variety of samples. - Time consuming.
Sampling is heterogeneous and - Ensures a high degree of - Tedious.
contains several representativeness of all
different groups, some the strata or layers in the
of which are related to population.
the topic of the study.
Cluster When the population - Less energy and money. - Possibly, members of units
Sampling consists of units rather - Easy and convenient. are different from one
than individuals. - Save time. another, decreasing the
techniques effectiveness.
SZS2017
Random Data Generation
From Normal Distribution
𝑋~𝑁 𝜇, 𝜎 2 𝑜𝑟 𝑍~𝑁 0, 1
𝜇 is mean
2
𝜎 is variance
SZS2017
Random Data Generation
From Poisson Distribution
X~Po λ , λ is average
value
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
SZS2017
1.2.4 Classifying and Summarising
the Data
In this step, the collected data are organised properly for further study and
investigation.
Data that has been collected during the sampling process is called raw data.
The simplest way to organise raw data systematically is by using data array.
Data array is an arrangement of data items in either ascending or
descending order (sorting).
1.2.4.1 Classifying
identify items with the same characteristics & arranging them into
groups or classes.
Data could be classified by its type or by its level of measurement.
1.2.4.2 Summarisation
Graphical & Descriptive statistics ( tables, charts, measures of central
tendency, measures of variation, measures of position)
SZS2017
Example of Raw Data
SZS2017
1.2.4.1 Data Classification
Data can be
classified
SZS2017
Nominal Data
Qualitative The values cannot be ranked
(categorical/Attributes) Gender, race, citizenship,
Data that refers to colour, etc.
classification name according Use code
to some characteristic or Ordinal Data numbers
The values can be ranked and (1, 2,…)
attribute
Data is classified using code likert scale is used
numbers Feeling (dislike-like),
Type colour (dark-bright), etc.
of
Data Discrete Data
The values can be counted and finite
Number of student, number of cat,
Quantitative (Numerical) number of defect, etc.
Data can be counted or
Continuous Data
measured The values can be placed within two
Data can be ordered or ranked specified values, obtained by measuring,
have boundaries, and shall be rounded to
require decimal places
Weight, age, salary, temperature, etc.
SZS2017
Levels of Measurement of Data
Levels Descriptions Examples
Nominal-level Classifies data into mutually Zip code (4, 5, 6,…),
exclusive (non-overlapping), Post code (25000, 25600, …),
exhausting categories in which Gender (female, male),
no order or ranking can be Eye colour (blue, brown, green, hazel),
imposed on the data. Political affiliation, Religion,
Nationality
Ordinal-level Classifies data into categories Grade (A, B, C, D, etc.),
that can be ranked; however, any Judging (first place, second place, etc.),
specific differences between the Rating scale (poor, good, excellent).
ranks do not exist. Color (light blue, …, dark blue)
Interval-level Ranks the data, and precise IQ test
differences between units of Temperature
measure do exist; however, there Shoe size
is no meaningful zero.
Ratio-level Possesses all the characteristics Height, Weight, Time, Salary
of interval measurement, and
there exists a true zero. SZS2017
EXERCISE 1.2.4.1
1. The SuperMotor Marketing Corporation has asked you for information
about the car you drive. For each question, identify each of the types of data
requested as either attribute data or numeric data. When atribute data is
requested, identify the variable either as nominal or ordinal. When
numeric data is requested, identify the variable either as discrete or
continuous. Then, identify the level of measurement for each variable.
SZS2017
EXERCISE 1.2.4.1
2. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
SZS2017
1.2.4.2 Data Summarisation
1) Descriptive statistics (refer Section 1.3)
Typically used to confirm conjectures about the data.
Quantitative data: measures of central tendency, measures of
variation (dispersion) and measures of position.
Qualitative data (non-numeric quality (attribute) or category):
measure the relative frequency for a particular characteristic
and calculate its percentage.
b) Graphical Summary
Organise the data in some meaningful way by constructing a
frequency distribution (refer Appendix A.1) for quantitative or
qualitative data.
A frequency distribution is the organisation of raw data in
table form, using classes and frequency
SZS2017
Graphical Statistics
The purpose of graphs in statistics is to convey the data to the viewer in pictorial
form and getting the audience’s attention in a publication or a presentation.
SZS2017
Bar Chart, Pareto Chart,
Pie Chart
SZS2017
BASIC INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Confidence Intervals An estimated range of values which is likely to include an unknown population
(CHAPTER 2) parameter, 𝜃 with a specified probability (confidence level) within that interval.
The interval is usually written as 𝒂, 𝒃 or 𝒂 < 𝜽 < 𝒃.
Hypothesis Testing A statement (claim or conjecture or assertion) concerning a parameter or
(CHAPTER 3) parameters of one or more populations.
• Statistical Analysis for one population (mean, variance, proportion)
• Statistical Analysis for two populations (mean, variance, proportion)
Analysis of Variance Statistical Analysis for three or more populations mean
(ANOVA) • One-way ANOVA
(CHAPTER 4) • Two-way ANOVA and Post Hoc Test
Linear Regression A statistical measure that attempts to determine the strength of relationship
Analysis between dependent (y) and independent variables (x).
(CHAPTER 5) • Simple linear regression analysis and correlation. (y vs x)
• Multiple linear regression analysis and correlation. (y vs xi)
• Model selection technique to chose a parsimony model that best fit the data.
Statistical Analysis for 1. Tests concerning frequency distributions for categorical data
Categorical Data (Goodness of Fit)
(CHAPTER 6) 2. Tests concerning specific probability distributions (Goodness of Fit)
3. Test the Independence of two variables (Contingency Table)
4. Test the homogeneity of proportions (Contingency Table)
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Experimental Planning, conducting, analysing and interpreting controlled tests to evaluate the factors
Design (DOE) that control the value of a parameter or group of parameters.
Example: ANOVA, Single factor experiment, Randomized Blocks, Latin Squares and
Related Design, Factorial Design, Response Surface Methodology, Nested and Split-Plot
Design
Time Series Modelling, making inference and producing forecast time series data for future
Analysis observations. Time series models are built to represent the serially correlated series,
trends, or seasonal effects.
Example: Linear Time Series, Linear Stationary Models (AR, MA, ARMA), Linear
Nonstationary Models (ARIMA, SARMA), Box-Jenkins Models, Volatility Models (ARCH,
GARCH), Hybrid models
Multivariate A central tool whenever many variables need to be considered at the same time.
Analysis Example: Mean Vector and Covariance Matrix Estimation, MANOVA, Principal
Component Analysis, Factor Analysis, Canonical Correlation Analysis, Discriminant
Analysis, Cluster Analysis
Statistical Quality Quality improvement through the use of modern statistical methods for quality control
Control (SQC) Example: Variables control charts, Attribute Control Charts, Time-Weighted Control
Charts, Multivariate Control Charts
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Statistical A mathematical equations that relate one or more random variables and possibly
Modelling other non-random variables, concerning the generation of some sample data and
similar data from a larger population.
• Example of Statistical Models: Generalised Linear Model, Dependence model,
Regression, Bayesian, markov chain, Random effect and mixed model
• The Process involve: parameter estimation, data generation, missing values,
outlier detection, simulation study, bootstrap, goodness of fit test
Data Mining A computing process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database system.
Example: Decision Tables, Decision Trees, Classification Rules, Association Rules,
Decision Tress, Clustering, Advanced linear model, Bayesian, Instance-based Learning
Circular Statistics A branch of statistics that involve circular data which deal with direction or cyclic
time. Circular data are measured in degrees (0,2π] or radian (0o, 360o].
Example: orientation of an animal, direction of wind and wave, days of the week,
compass direction, waves of sound, the human perception under various conditions,
the orientation of ridges of fingerprints, the orientation of sand grains from a beach,
the death due to a disease at various times in a year, and astronomical observations.
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Advanced Regression • Polynomial Regression: y is modelled as an nth degree polynomial in x
Analysis
• Multivariate Regression: Y is a matrix with series of multivariate dependent
measurements and X is a matrix of observations on independent variables.
• Generalized Linear Model: A flexible generalization of ordinary linear
regression that allows for response variables that have error distribution
models other than a normal distribution.
• Logistic Regression: A regression model where the dependent variable is
categorical.
• Nonlinear Regression: The observational data are modeled by a function
which is a nonlinear combination of the model parameters and depends on
one or more independent variables
• Error in Variables: a regression model that account for measurement errors
in the independent variables.
1.2.6 Make the decision
and conclusion
The researchers can make decisions in order to achieve the
objective and goal of the research and choose the best options
which represents the ‘best’ solution to the problem.
SZS2017
1.3 REVIEWS ON
DESCRIPTIVE
STATISTICS
Summarise the data using measures of central
tendency, such as the mean, median, mode, and
midrange.
Describe the data using measures of variation, such
as the range, variance, standard deviation and
coefficient of variation.
Identify the position of a data value in a data set
using measures of position such as quartiles, deciles,
and percentiles.
SZS2017
Reviews on
Descriptive Statistics
SZS2017
RULE OF THUMB FOR DECIMAL
PLACES
2. If the unit is given (in cm, minute, day, etc.), the value should
be rounded to that unit’s decimal places.
SZS2017
TIPS: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS
Casio fx-570ES
SZS2017
Midrange (MR)
Is a rough estimate of the middle
Properties of Midrange
A rough estimate of the average
Can be affected by one extremely high or low value (outlier).
SZS2017
Mean
Is the sum of the values divided by the total number of values
N n
x i x i
i 1
, N population size x i 1
, n sample size
N n
If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is 5.1667 if the data is taken from the population.
The value is a true mean or a parameter.
‒ the calculated mean is x 5.1667 if the data is taken from the sample.
The value is a sample mean or a statistic.
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS
Casio fx-570ES
If n is odd If n is even
Median(MD) x n 1 xn xn
1
2 Median(MD) 2 2
2
x x
If the data set is 1, 3, 5, 7, 7, 8, then the calculated median is, Median 3
6.4
SZS2017
Mode
Is the most commonly occurring value in a data series
EXAMPLE 1.4:
a) If the data set are 1, 6, 3, 7, 8, 5 then the mode is not exist.
b) If the data set are 1, 6, 3, 7, 8, 3, 5 then the mode is 3.
c) If the data set are 1, 6, 3, 7, 3, 8, 7, 5, 3, 7 then the mode is 3 and 7.
Properties of Mode
The mode is used when the most typical case is desired.
The mode is can be used when the data are nominal.
The mode is not always unique.
A data set can have more than one mode, or the mode may not
exist for a data set.
SZS2017
Identify the Shapes of Data
Distribution
Symmetric Positively skewed / Negatively skewed/
right-skewed left-skewed
Mean Median Mode Mean Median Mode Mean Median Mode
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS
Casio fx-570ES
SZS2017
EXAMPLE 1.5
An extreme value, let say 21 is added to the data set in Example 1.3. The new
data set are 1, 3, 5, 7, 7, 8, 21. Assume that the data is taken from a sample, then
a) symmetric b) right-skewed c) left-skewed d) Mean = 12.88, Median = 13.05, mode = 13.3, left-skewed
SZS2017
EXERCISE 1.3.1
2. The following set of data represents the number of hospitals
for selected countries.
123 108 195 138 115 179 119 148 147 180
146 178 189 108 193 114 179 147 108 128
164 174 128 159 193 175
SZS2017
1.3.2 Measures of Variation/Dispersion
Measures of variation or measures of dispersion are measures
that determine the spread of data values.
1. Range: the simplest measure of variation
2. Variance, and
more meaningful and popular
3. Standard deviation. measures that describes the
4. Coefficient of Variation variability of data
SZS2017
Range (R)
Is the different between the highest value and the lowest value in a
data set
EXAMPLE 1.6:
Suppose the data set is 1, 6, 3, 7, 8, 5, then the calculated range is, R 8 1 7 .
Properties of Range
The simplest measure of variation.
Easily affected by one extremely high or low value (outliers).
SZS2017
Variance
Is the average of the squares of the distance each value is from the mean.
Standard Deviation
Is the square root of the variance
xi xi x
2 2
i 1
, N population size s i 1
, n sample size
N n 1
SZS2017
Properties of Variance & Standard Deviation
The variance is the average of the squares of the distance each value
is from the mean.
If the data values are near the mean, the variance will be smaller.
If the data values are far from the mean, the variance will be larger.
The square distance is used since the sum of the distances will
always be zero.
Variance is always a positive value.
There is no unit for the resultant variance.
Standard deviation is the square root of the variance.
Standard deviation is measure of deviations of values from the
mean.
Standard deviation is always positive value.
The units of standard deviation are similar as the unit of the data.
SZS2017
Coefficient of Variation
Is the standard deviation divided by the mean.
Properties of CVar
The result is expressed as percentage.
A parameter/statistic that allows user to compare the standard deviations
when the units are different (the variables are different).
RECALL: Descriptive Statistics using
Scientific Calculator
Casio fx-570MS
Casio fx-570ES
SZS2017
Why we Need Measures of Variation
• Measures of variation can be a judgment about how well the
measures of average illustrate or depict the data.
• It is also called measure of variation because it can measure the
variability that exists in a data set.
• It can be used when the measures of central tendency do not give
any significant meaning or not needed/practical.
EXAMPLE:
Suppose we wish to compare the performance of two groups of student
in a test. Given that the mean values are the same for both data sets.
In short, you might conclude that these two groups of students are
equally well performed in the test. However, if the data sets are
examined graphically as shown in Figure 1.10, a different conclusion
might be drawn.
SZS2017
Examining Data Sets Graphically
SZS2017
EXAMPLE 1.7
The following data represents the age (in years) of lecturers in two faculties at UMP.
FIST: 24, 25, 26, 27, 30, 31, 31, 32, 36, 40, 43, 44, 45
FKEE: 22, 25, 25, 25, 28, 33, 34, 36, 37, 40, 41, 43, 48, 51, 53
For these sample data sets, find the standard deviations. Then, identify which data set
is more consistent and less dispersed. What can you say about the variation of age for
lecturers in both faculties?
Solution:
Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
s A 3.6742 sB 7.8493
A: 4.2, 6.7, 7.3, 7.5, 8.0, 8.5, 8.7, 8.8, 9.2, 9.3
B: 9.6, 9.7, 9.8, 9.9, 10.1, 10.2, 11.0, 11.0, 11.0, 11.1
s A 1.5 hours sB 0.6 hours
SZS2017
Comparing Two Data Sets with
different units/variable
If the two samples do not have the same units of measurement or the
variables are different, the variance and standard deviation for each
sample cannot be compared directly.
Hence, the best way to compare the variability within these two
variables is by using the coefficient of variation.
CVar age
12.90% CVar income 17.63%
SZS2017
Other Properties of Standard Deviation
Use to determine the number of data values that fall within a
specified interval in a distribution.
a) If you are playing football and you always hit the left goal post
instead of scoring.
b) A candy manufacturer claims that each packet contains 20 candies.
A sample of packet have 18, 21, 19, 21, 19, 20, 22 candies,
respectively. The average is 20 candies with an error of 1 candy.
c) A manufacturer claims that each chocolate packet contains 20
chocolates. A sample of packets have 17, 18, 18, 17, 18, 17, 17
chocolates, respectively.
d) In an experiment, with five trials, the end results of the five trials for
whatever is being tested are: 35 kg, 36 kg, 36 kg, 35 kg, 36 kg. The
actual value (as found in a scientific data book) is meant to be 42 kg.
e) In an experiment, with five trials, the average value is 35 kg. The
actual value (as found in a scientific data book) is meant to be 35 kg.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
MIND EXPANDING EXERCISES
8. In an analysis of the accuracy of weather forecasts, the actual high
temperature are compared to the high temperatures predicted one day earlier
and the temperatures predicted five days earlier. Listed below are the errors
between the predicted temperatures and the actual high temperatures for 14
consecutive days in Kuala Lumpur.
Actual high ‒ 2 2 0 0 ‒ 3 ‒2 1
High predicted one day earlier ‒2 8 1 0 ‒ 1 0 1
Actual high ‒ 0 ‒3 2 5 ‒ 6 ‒9 4
High predicted five days earlier ‒1 6 ‒2 ‒2 ‒ 1 6 ‒4
a) Do the means and medians of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
b) Do the standard deviations of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
SZS2017
ME.8 (solution)
Mean median sd
1.5000 1.0000 2.4152
3.8333 4.5000 2.4014
SZS2017
MIND EXPANDING EXERCISES
9. A data set consists of 20 values that are fairly close together. Another
value is included, but this new value is an outlier (very far away from
the other values). How is the standard deviation affected by the
outlier? No effect? A small effect? Or a large effect?
11. When designing the production procedure for batteries used in heart
pacemakers, an engineer specifies that “the batteries must have a
mean life greater than 10 years, and the standard deviation of the
battery life can be ignored.” If the mean battery life is greater than 10
years, can the standard deviation be ignored? Why or why not?
SZS2017
1.3.3 Measures of Position
Describe where a specific data value falls within the data set or its
relative position based on percentiles, deciles and quartiles in
comparison with other data values
Describing the position of
the data value
(increasing order)
Pi x in xc Di x in xc Qi xin xc
100 10 4
SZS2017
Pi x in xc Di x in xc Qi xin xc
100 10 4
Quartiles Percentiles
Q1 x1 11 x2.75 x3 27 P25 x 25 11 x2.75 x3 27
4 100
Q2 x 2 11 x5.50 x6 36 P50 x 50 11 x5.50 x6 36
4 100
Q3 x 311 x8.25 x9 42 P75 x 75 11 x8.25 x9 42
4 100
SZS2017
EXAMPLE 1.9
Deciles Percentiles
D3 x 3 11 x3.3 x4 30 P30 x 30 11 x3.3 x4 30
10 100
D5 x5 11 x5.5 x6 36 P50 x50 11 x5.5 x6 36
10 100
D7 x 711 x7.7 x8 40 P70 x 70 11 x7.7 x8 40
10 100
SZS2017
EXERCISE 1.3.3
1. Given a set of data as 9 2 1 4 3 7 5 4 6 .
1) 4, 6 2) 8, 15.5
SZS2017
Why We need Measures of Position?
Percentiles are one of measures of position that often used in
educational and health related fields to indicate the position
of an individual in a group.
Percentile is not a percentage value. The ith percentile, is a
value that i % of the data are less than or equal to Pi and
(100-i) % are greater than or equal to Pi.
EXAMPLE:
If a student obtained 82 marks over 100 in a test , he/she will
obtain 82% of score. However, there is no indication of his/her
position with respect to the rest of the class. On the other hand,
if his/her score corresponds to the 75th percentile, then he/she
did better than 75% of the students in his/her class.
SZS2017
Why We need Measures of Position?
Quartiles can be used as a rough measurement of variability.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
1.3.4 Descriptive Statistics
Using Microsoft Excel
SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ x1 x2 , in average Assembly Line 2 produced more
number of finished units per day.
SZS2017
MIND EXPANDING EXERCISES
ME.12
SZS2017
MIND EXPANDING EXERCISES
13. A study is conducted to compare the performance of male and female
students in the statistics course for final examination scores. The
data, descriptive statistics and graph of the final examination scores
are presented as follow. Based on the analysis, answer the following
questions:
72 62 83 65 60 74 66 68 57 63 61
Female
76 60 78 34 70 59 63 86 43 90 87
58 81 86 68 70 77 54 54 72 41 33 52
Male
70 37 67 39 74 32 8 33 27 23 54
SZS2017
MIND EXPANDING EXERCISES
ME.13
a) State the mean and standard deviation for both groups and give your
comment.
b) Based on the graph shown, give your comment.
SZS2017
MIND EXPANDING EXERCISES
14. People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
SZS2017
MIND EXPANDING EXERCISES
Before After
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Key: 14|1=141
SZS2017
1.4 EXPLORATORY
DATA ANALYSIS
Identify outliers.
Draw and interpret a boxplot.
SZS2017
Exploratory Data Analysis
Traditional Method Exploratory Data Analysis
Frequency distribution Stem and leaf plot
Histogram Boxplot
Mean Median
Interquartile range
Standard deviation
(IQR=Q3-Q1)
The purpose of exploratory data analysis is to discover any gaps or
pattern in the data.
For symmetric data, the appropriate measure of central tendency
is mean and for variability is standard deviation or variance.
For skewed data, the appropriate measure of central tendency is
median and for measure of variability is interquartile range (IQR).
SZS2017
RECALL: Selection of appropriate
statistical techniques for data
summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative Mean, Median, Mode, Histogram, Bar Chart (bar
(ratio scale) Range, Standard Deviation, representing means), stem
Interquartile range (IQR and leaf plot, Boxplot
=Q3-Q1)
Symmetrical Mean, Median, Mode, Histogram, Bar Chart (bar
Distribution Range, Standard Deviation representing means)
SZS2017
EXAMPLE 1.11
The number of credits in business courses for eight job applicants is
shown here:
9, 12, 15, 27, 33, 45, 63, 72.
Find the first and third quartiles for the above data. Is there any
outlier on the above data?
x2 x3
Q1 x18 x2 13.5
4
2
x6 x7
Q3 x 38 x6 54
4
2
Q1 3, Q3 6; 19 is outliers
Q1 5, Q3 11; 21 is outliers
SZS2017
MIND EXPANDING EXERCISES
14. People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
SZS2017
MIND EXPANDING EXERCISES
Before After
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Key: 14|1=141
SZS2017
1.4.2 Boxplots
Boxplot (Box and Whiskers plot) is graphical representations of a five-
number summary of a data set and outliers.
The lowest value of data set (minimum)
The lower quartile Q1 (1st Quartile or 25th percentile)
The median (2nd Quartile or 50th percentile)
five-number
summaries
The upper quartile Q3 (3rd Quartile or 75th percentile)
The highest value of data set (maximum) + Outliers
Outliers
SZS2017
Types of Boxplots
A Horizontal boxplot
A Vertical boxplot
SZS2017
SZS2017
EXAMPLE 1.12
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4 [key: 3|4 → 34]
Given that for School B, Q1 36, Q2 42, Q3 47 and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
School A School B
Minimum 24 22
1st quartile Q1 x114 x3.5 x4 27 Q1 36
4
2nd quartile/ x7 x8 Q2 42
Median Q2 30.5
2
3rd quartile Q3 x 314 x10.5 x11 36 Q3 47
4
Maximum 38 54
Outliers Q1 1.5 Q3 Q1 27 1.5(36 27) 13.5 no outlier
Q3 1.5 Q3 Q1 36 1.5(36 27) 49.5
Since 57 > 49.5, Thus 57 is an outlier.
SZS2017
Information Obtain from a Boxplot
1. If the median is near the centre of the box, the distribution is approximately
symmetric.
2. If the median falls to the left of the centre of the box, the distribution is positively
skewed.
3. If the median falls to the right of the centre of the box, the distribution is
negatively skewed.
Suppose the median is near the centre of the box (approximately symmetric):
4. If the lines are about the same length, the distribution is approximately
symmetric.
5. If the right line is larger than the left line, the distribution is positively skewed.
6. If the left line is larger than the right line, the distribution is negatively skewed.
If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared using their central tendency (average) and
variability values.
To compare the average, use the location of the medians.
To compare the variability, useSZS2017
the length of the IQR.
EXAMPLE 1.12
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4 [key: 3|4 → 34]
Given that for School B, Q1 36, Q2 42, Q3 47 and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
School A School B
Minimum 24 22
1st quartile Q1 x114 x3.5 x4 27 Q1 36
4
2nd quartile/ x7 x8 Q2 42
Median Q2 30.5
2
3rd quartile Q3 x 314 x10.5 x11 36 Q3 47
4
Maximum 38 54
Outliers Q1 1.5 Q3 Q1 27 1.5(36 27) 13.5 no outlier
Q3 1.5 Q3 Q1 36 1.5(36 27) 49.5
Since 57 > 49.5, Thus 57 is an outlier.
SZS2017
EXAMPLE 1.12 solution
Shape:
Based on the location of median, School A has right-skewed distribution where most of
teachers’ age is concentrated at the lower age (< 30 years old). However, School B has
left-skewed distribution where most of teachers’ age is greater than 42 years old.
Average:
Based on the median value, 50% of teacher at School A age less than 30.5 years old
whereas 50% of teacher at School B age less than 42 years. On average, teachers at
School B is older than the teachers at School A.
SZS2017
EXAMPLE 1.12 solution
Variability:
Based on the IQR value, for School A, IQRA = 9 years where most 50% of the teachers
age between 27-36 years old. Meanwhile, for School B, IQRB = 11 years where most
50% of the teachers age between 36-47 years. Hence, the variation of teachers’ age at
School B is higher than age of teacher at School A (IQRA < IQRB).
Range:
Without outlier, teachers’ age at school A varies less from minimum age of 24 years to
maximum age of 38 years as compared to School B with minimum age of 22 years to
maximum of 54 years.
SZS2017
Boxplot for Special Case
In some cases, we cannot use the general guideline as given above to interpret the
boxplot.
Boxplot is not the best graphical representation to describe a data set if the sample
size of the data set is too small.
The existence of outliers also may affect the boxplot.
Therefore, in such cases, we have to use the descriptive statistics to identify the
distribution of the data set.
SZS2017
EXERCISE 1.4.2 (Q1)
1. Plot a boxplot for the following data. Then describe the data.
a) 3.2, 5.9, 4.3, 6.9, 4.5, 8.0, 4.7, 8.9, 5.7, 11.9
SZS2017
1.4.2 (Q1) solution
SZS2017
EXERCISE 1.4.2(Q2)
2. Two samples of ten springs made out of the steel rods supplied by
two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows. Compare
the distributions using box-plots.
Company A : Min 6.7, Q1 7.3, Q2 8.25, Q3 8.8, 4.2 is outlier, Max 9.3, left-skewed
Company B : Min 9.6, Q1 9.8, Q2 10.15, Q3 11.0, no outlier, Max 16.4, right-skewed
SZS2017
1.4.2 (Q2) solution
EXERCISE 1.4.2 (Q3)
3. The following Table presents viscosity (in Pascal) of chemical substance from
three (3) batches of chemical process.
Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9
b) Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape and variability.
Batch A : Q3 15.2, right-skewed; Batch B : Q2 15.05, no outlier, left-skewed; Batch C : Q1 14.1, right-skewed
SZS2017
1.4.2 (Q3) solution
17
16.5
16
15.5
15
14.5
14
13.5
13
12.5
12
Batch A Batch B Batch C
MIND EXPANDING EXERCISES
ME.15
SZS2017
MIND EXPANDING EXERCISES
15. An experiment was conducted to assess the potency of various constituents of
orchard sprays in repelling honeybees. Individual cells of dry comb were filled
with measured amounts of lime Sulphur emulsion in sucrose solution. Seven
different concentrations of lime Sulphur ranging from a concentration of 1/100
to 1/1,562,500 in successive factors of 1/5 were used as well as a solution
containing no lime Sulphur (A, B, C, D, E, F, G, H). The responses for the
different solutions were obtained by releasing 100 bees into the chamber for
two hours, and then measuring the decrease in volume of the solutions in the
various cells. Based on the figure below, answer the following questions:
a) Which concentration has outlier(s)?
b) Group the concentration according to their shape of distribution.
c) Which concentration has the most consistent data? Why?
d) Which concentration has the most variable data? Why?
e) H is the concentration of ‘no lime sulphur’. What is the use of
concentration H?
f) What conclusion can you draw from this experiment?
SZS2017
1.5 NORMAL
PROBABILITY PLOT
SZS2017
Normal Probability Plots
The easiest way to check whether the sample distribution is normal or not.
The most plausible normal distribution is the one whose mean and standard deviation
are the same as the sample mean and standard deviation.
STEP 1 : Sort the data in ascending order and denote each sorted data as
xi , i 1, , n.
STEP 2 : Numbered the sorted data from i to n.
i 0.5
STEP 3 : Calculate the probability value for each xi using pi .
n
STEP 4 : Plot pi versus xi.
SZS2017
Testing Normality using
Software
Other than plot manually, we can obtain it from software such as SPSS,
Minitab, Excel, and etc. The normality of the data also can be tested by
using Kolmogorov Smirnov, Anderson Darling or Shapiro-Wilk Tests.
SZS2017
EXAMPLE 1.13
SZS2017
EXERCISE 1.5
1. A sample of size six is drawn. The sample, arranged in
increasing order, is
3.01 3.35 4.79 5.96 7.89 9.15
Do these data appear to come from an approximately normal
distribution?
1) yes 2) no
SZS2017
1.5 (Q1) solution
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
SZS2017
1.5 (Q2) solution
1.2000
1.0000
0.8000
Pi 0.6000
0.4000
0.2000
0.0000
0 500 1000 1500 2000 2500
xi
SZS2017
CONCLUSION
• The applications of statistics are
many and varied. People
encounter them in everyday life,
such as in reading newspapers or
magazines, listening to the radio,
or watching television.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017
REFERENCES
1. Walpole R.E., Myers R.H., Myers S.L. & Ye K. 2011. Probability and Statistics for Engineers
and Scientists. 9th Edition. New Jersey: Prentice Hall.
2. Navidi W. 2011. Statistics for Engineers and Scientists. 3rd Edition. New York: McGraw-Hill.
3. Triola, M.F. 2006. Elementary Statistics.10th Edition. UK: Pearson Education.
4. Bluman A.G. 2009. Elementary Statistics: A Step by Step Approach. 7th Edition. New York:
McGraw–Hill.
5. Weiss, N.A. 2002. Introductory Statistics. 6th Edition. United States: Addison-Wesley.
6. Sanders D.H. & Smidth R.K. 2000. Statistics: A First Course. 6th Edition. New York: McGraw-
Hill.
7. Crawshaw, J. & Chambers,J. 2001. A Concise Course in Advance Level Statistics with Work
Examples, 4th Edition, Nelson Thornes.
8. Satari S. Z. et al. Applied Statistics Module New Version. 2015. Penerbit UMP. Internal used.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017