Desc. Stat
Desc. Stat
Desc. Stat
Descriptive Statistics
TEXTBOOKS (REQUIRED MATERIALS)
1. Statistics for Business & Economics by David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jeffrey
D. Camm; James J. Cochran. Cengage Learning
Additional References
• Aczel, A. D., & Sounderpandian, J. (1999). Complete business statistics. Boston, MA: Irwin/McGraw Hill.
• Business Statistics for Contemporary Decision Making. Ken Black. Wiley India.
• Lecture Notes (Notes will be distributed each week by the faculty and/or shared through google classroom.)
Structured and Unstructured Data
• Structured data means that the data is described in a matrix
form with labelled rows and columns.
• Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
Structured data consisting of nominal and ratio scales
Percentage
No. Gender Age Percentage SSC Board SSC Percentage HSC Salary
Degree
1 M 23 62 Others 88 52 270000
4 M 22 60 CBSE 63 58 250000
5 M 22 61 CBSE 55 54 180000
6 M 23 55 ICSE 64 50 300000
7 F 24 70 Others 54 65 240000
10 F 23 59 CBSE 74 59 240000
Data Type
𝑛
𝑥1 +𝑥2 +⋯+𝑥𝑛 𝑥𝑖
Mean=𝑥ҧ = =
𝑛 𝑖=1 𝑛
Mean
−
• Symbol X is frequently used to represent the estimated value of the mean from a
sample.
• If the entire population is available and if we calculate mean based on the entire
population, then we have the population mean which is denoted by (population
mean).
• In following Table, the average salary is given by
− (270 + 220 + 240 + 250 + 180 + 300 + 240 + 235 + 425 + 240) 1000
X= = 260000
10
Property of Mean
An important property of mean is that the summation of deviation of observations from
the mean is zero, that is −
n
Xi − X
=0
i =1
Median (or Mid) Value
• Median is the value that divides the data into two equal parts, that is, the proportion of
observations below median and above median will be 50%.
• Easiest way to find the median value is by arranging the data in the increasing order and the
median is the value at position (n + 1)/2 when n is odd. When n is even, the median is the
average value of (n/2)th and (n + 2)/2th observation after arranging the data in the increasing
order.
• Ex:
• The number of deposits in a branch of a bank in a week is
Day 1 2 3 4 5 6 7
Number of 245 326 180 226 445 319 260
Deposits
• The ascending order of the data in Table is given by 180, 226, 245, 260, 319, 326 and 445.
• Now (n + 1)/2 = (8/2) = 4. Thus the median is the 4th value in the data after arranging them
in the increasing order; in this case it is 260
Mode
• Mode is the most frequently occurring value in the dataset
• Mode is the only measure of central tendency which is valid for qualitative (nominal) data
since the mean and median for nominal data are meaningless.
• For example, assume that a customer data with a retailer has the marital status of
customer, namely, (a) Married, (b) Unmarried, (c) Divorced Male, and (d) Divorced Female.
Mean and median are meaningless when we try to use them on a qualitative data such as
marital status. On the other hand, mode will capture the customer type in terms of
marital status that occurs most frequently in the database
Measures of Variation
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Sample Variance
• In case of a sample, the Sample Variance
−
(S2) is calculated using
2 ( X i − X )2
n
S =
i =1 n −1
• While calculating sample variance S2, the sum of squared deviation is divided by
(n-1), this is known as Bessel’s correction.
2
n −
X i − X
i =1
Range, IQD and Variance
• Range is the difference between maximum and minimum value of the
data. It captures the data spread.
• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is a
measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3)
• Variance is a measure of variability in the data from the mean value.
Variance for population, 2, is calculated using
( n
X − ) 2
Variance = 2 = i
i =1 n
Standard Deviation
The population standard deviation () and sample standard deviation (S) are given by
−
n
( Xi − ) 2 n( X i − X )2
=
i =1 n
S=
i =1 n − 1
Degrees of Freedom
Histogram
• Histogram is the visual representation of the data which can be used to assess the
probability distribution (frequency distribution) of the data
Here Xmax and Xmin are the maximum and minimum values of the data and
W is desired the width of the bin (interval). Intervals in histograms are usually of equal size
Step 2: Count the number of observations from the data that fall under each bin (interval).
Step 3: Create a frequency distribution (bin in the horizontal axis and frequency in the
vertical axis) using the information obtained in steps 1 and 2
Use of Histogram
• The following formula is used usually for a sample with n observations (Joanes
and Gill, 1998): n(n − 1)
G1 = g1
n−2
Kurtosis
• Kurtosis is another measure of shape, aimed at shape of the tail, that is,
whether the tail of the data distribution is heavy or light. Kurtosis is
measured using the following equation:
4 − 4
Kurtosis = X i − X / n
i =1
4
4 − 4
X i − X / n
Excess Kurtosis= i =1 −3
4
Chebyshev’s Theorem
P( − k X + k ) 1 − 2
1
k
• Ex: Amount spent per month by a segment of credit card users of a bank has a
mean value of 12000 and standard deviation of 2000. Calculate the proportion
of customers who are spending between 8000 and 16000?
• Solution:
1
P(8000 X 16000)=P( − 2 X + 2) 1 − 2
= 0.75
2
That is, the proportion of customers spending between 8000 and 16000 is at least 0.75 (or 75%)
Example (Percentile Calculation)
Time between failures of wire-cut (in hours)
2 22 32 39 46 56 76 79 88 93
3 24 33 44 46 66 77 79 89 99
5 24 34 45 47 67 77 86 89 99
9 26 37 45 55 67 78 86 89 99
21 31 39 46 56 75 78 87 90 102
1. Calculate the mean, median, and mode of time between failures of wire-cuts
2. The company would like to know by what time 10% (ten percentile or P10) and
90% (ninety percentile or P90) of the wire-cuts will fail?
3. Calculate the values of P25 and P75.
Solution
1. Mean = 57.64, median = 56, and mode = 46
Instead of rounding the value obtained from Eq, we can use the following approximation: P10
= 10 × (51)/100 = 5.1
Value at 5th position is 21. Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th
position – value at 5th position) = 21 + 0.1(1) = 21.1
P90 = 90 × 51/100 = 45.9
The value at position 45 is 90 and at position 45.9 is 90 + 0.9 × (3) = 92.7
That is, 90% of the wire-cuts will fail by 92.7 hours
3. P25 (1st Quartile or Q1) = 25 × 51/100 = 12.75 , Value at 12th position is 33, so
P25 = 33 + 0.75 (value at 13th position – value at 12th position) = 33 + 0.75 (1) = 33.75
• Pie chart is mainly used for categorical data and is a circular chart that
displays the proportion of each category in the dataset
Pie chart for movie genre
Scatter Plot
• Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables
• scatter plot is also useful for assessing the strength of the relationship
and to find if there are any outliers in the data
Scatter plot between movie budget and box office
collection
Box Plot (or Box and Whisker Plot)
• The box plot is constructed using IQR, minimum and maximum values
Bollywood movie Budget Boxplot