Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Desc. Stat

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Business Statistics

Descriptive Statistics
TEXTBOOKS (REQUIRED MATERIALS)

1. Statistics for Business & Economics by David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jeffrey
D. Camm; James J. Cochran. Cengage Learning

Additional References
• Aczel, A. D., & Sounderpandian, J. (1999). Complete business statistics. Boston, MA: Irwin/McGraw Hill.

• Business Statistics for Contemporary Decision Making. Ken Black. Wiley India.

• Statistics for Management. Richard I. Levin & David S. Rubin. Pearson.

• Lecture Notes (Notes will be distributed each week by the faculty and/or shared through google classroom.)
Structured and Unstructured Data
• Structured data means that the data is described in a matrix
form with labelled rows and columns.

• Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
Structured data consisting of nominal and ratio scales

Percentage
No. Gender Age Percentage SSC Board SSC Percentage HSC Salary
Degree

1 M 23 62 Others 88 52 270000

2 M 21 76.33 ICSE 75.33 75.48 220000

3 M 22 72 Others 78 66.63 240000

4 M 22 60 CBSE 63 58 250000

5 M 22 61 CBSE 55 54 180000

6 M 23 55 ICSE 64 50 300000

7 F 24 70 Others 54 65 240000

8 M 22 68 ICSE 77 72.5 235000

9 M 24 82.8 CBSE 70.6 69.3 425000

10 F 23 59 CBSE 74 59 240000
Data Type

• Cross-Sectional Data: A data collected on many variables of interest at


the same time or duration of time is called cross-sectional data.

• Time Series Data: A data collected for a single variable such as


demand for smartphones collected over several time intervals
(weekly, monthly, etc.) is called a time series data.

• Panel Data: Data collected on several variables (multiple dimensions)


over several time intervals is called panel data (also known as
longitudinal data).
TYPES OF DATA MEASUREMENT SCALES

• Nominal scale refers to variables that are basically names (qualitative


data) and also known as categorical variables.
• Ordinal scale is a variable in which the value of the data is captured from
an ordered set, which is recorded in the order of magnitude.
• Interval scale corresponds to a variable in which the value is chosen from
an interval set. Variable such as temperature measured in centigrade) or
intelligence quotient (IQ) score are examples of interval scale
• Any variable for which the ratios can be computed and are meaningful is
called ratio scale.
Population And Sample

 Population is the set of all possible observations (often


called cases, records, subjects or data points) for a given
context of the problem.

 Sample is the subset taken from a population.


Measures Of Central Tendency

• Mean (or Average) Value


Mean is the arithmetical average value of the data and is one of the
most frequently used measures of
central tendency.

𝑛
𝑥1 +𝑥2 +⋯+𝑥𝑛 𝑥𝑖
Mean=𝑥ҧ = = ෍
𝑛 𝑖=1 𝑛
Mean

• Symbol X is frequently used to represent the estimated value of the mean from a
sample.
• If the entire population is available and if we calculate mean based on the entire
population, then we have the population mean which is denoted by  (population
mean).
• In following Table, the average salary is given by
− (270 + 220 + 240 + 250 + 180 + 300 + 240 + 235 + 425 + 240)  1000
X= = 260000
10

Property of Mean
An important property of mean is that the summation of deviation of observations from
the mean is zero, that is −
n  
  Xi − X 

=0
i =1 
Median (or Mid) Value

• Median is the value that divides the data into two equal parts, that is, the proportion of
observations below median and above median will be 50%.
• Easiest way to find the median value is by arranging the data in the increasing order and the
median is the value at position (n + 1)/2 when n is odd. When n is even, the median is the
average value of (n/2)th and (n + 2)/2th observation after arranging the data in the increasing
order.

• Ex:
• The number of deposits in a branch of a bank in a week is
Day 1 2 3 4 5 6 7
Number of 245 326 180 226 445 319 260
Deposits

• The ascending order of the data in Table is given by 180, 226, 245, 260, 319, 326 and 445.
• Now (n + 1)/2 = (8/2) = 4. Thus the median is the 4th value in the data after arranging them
in the increasing order; in this case it is 260
Mode
• Mode is the most frequently occurring value in the dataset

• Mode is the only measure of central tendency which is valid for qualitative (nominal) data
since the mean and median for nominal data are meaningless.

• For example, assume that a customer data with a retailer has the marital status of
customer, namely, (a) Married, (b) Unmarried, (c) Divorced Male, and (d) Divorced Female.
Mean and median are meaningless when we try to use them on a qualitative data such as
marital status. On the other hand, mode will capture the customer type in terms of
marital status that occurs most frequently in the database
Measures of Variation
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Sample Variance
• In case of a sample, the Sample Variance

(S2) is calculated using

2 ( X i − X )2
n
S = 
i =1 n −1

• While calculating sample variance S2, the sum of squared deviation is divided by
(n-1), this is known as Bessel’s correction.
2
n  −
  X i − X 
i =1 
Range, IQD and Variance
• Range is the difference between maximum and minimum value of the
data. It captures the data spread.
• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is a
measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3)
• Variance is a measure of variability in the data from the mean value.
Variance for population, 2, is calculated using
( n
X −  ) 2
Variance =  2 =  i
i =1 n

Standard Deviation
The population standard deviation () and sample standard deviation (S) are given by

n
( Xi − ) 2 n( X i − X )2
 = 
i =1 n
S= 
i =1 n − 1
Degrees of Freedom

• Degrees of freedom is equal to the number of independent variables in


the model (Trochim, 2005).

• For example, we can create any sample of size n with mean value of X by
randomly selecting (n − 1) values. We need to fix just one out of n values.
Thus the number of independent variables in this case is (n − 1)

• Degrees of freedom is defined as the difference between the number of


observations in the sample and number of parameters estimated (Walker
1940, Toothaker and Miller, 1996). If there are n observations in the
sample and k parameters are estimated from the sample, then the
degrees of freedom is (n − k).
Data Visualization
• There are many useful charts such as histogram, bar chart, pie-chart, box-plot
that would assist data scientist with visualization of the data

Histogram
• Histogram is the visual representation of the data which can be used to assess the
probability distribution (frequency distribution) of the data

• Histograms are created for continuous (numerical) data.

• It is a frequency distribution of data arranged in consecutive and non-overlapping


intervals
Steps to construct histograms
Step 1: Divide the data into finite number of non-overlapping and consecutive bins
(interval) X max − X min
Number of bins, N = W

Here Xmax and Xmin are the maximum and minimum values of the data and
W is desired the width of the bin (interval). Intervals in histograms are usually of equal size

Step 2: Count the number of observations from the data that fall under each bin (interval).

Step 3: Create a frequency distribution (bin in the horizontal axis and frequency in the
vertical axis) using the information obtained in steps 1 and 2
Use of Histogram

Histogram is very useful since it assists data scientist to identify


the following:
• The shape of the distribution and to assess the probability
distribution of the data.
• Measures of central tendency such median and mode.
• Measures of variability such as spread.
• Measure of shape such as skewness
Histogram of Bollywood movie budget
Histogram of Bollywood movie box-office collection
Histogram of Bollywood movie budget along with normal
distribution frequency
Skewness

• The following formula is used usually for a sample with n observations (Joanes
and Gill, 1998): n(n − 1)
G1 = g1
n−2
Kurtosis
• Kurtosis is another measure of shape, aimed at shape of the tail, that is,
whether the tail of the data distribution is heavy or light. Kurtosis is
measured using the following equation:

4  − 4
Kurtosis =   X i − X  / n
i =1 
4

• Kurtosis value of less than 3 is called platykurtic distribution and greater


than 3 is called leptokurtic distribution.
• The kurtosis value of 3 indicates standard normal distribution (also called
mesokurtic)
Leptokurtic, mesokurtic, and platykurtic distributions
Excess Kurtosis

• The excess kurtosis is a measure that captures deviation from kurtosis of a


normal distribution and is given by:

4  − 4
  X i − X  / n
Excess Kurtosis= i =1  −3
4
Chebyshev’s Theorem

• Chebyshev’s theorem (also known as Chebyshev’s inequality) is an empirical


rule that allows us to predict proportion of observations that is likely to lie
between an interval defined using mean and standard deviation. Probability of
finding a randomly selected value in an interval defined by   k is k 2 that is
1 −
1

P( − k  X   + k )  1 − 2
1
k
• Ex: Amount spent per month by a segment of credit card users of a bank has a
mean value of 12000 and standard deviation of 2000. Calculate the proportion
of customers who are spending between 8000 and 16000?
• Solution:
1
P(8000  X  16000)=P( − 2  X   + 2)  1 − 2
= 0.75
2

That is, the proportion of customers spending between 8000 and 16000 is at least 0.75 (or 75%)
Example (Percentile Calculation)
Time between failures of wire-cut (in hours)
2 22 32 39 46 56 76 79 88 93

3 24 33 44 46 66 77 79 89 99

5 24 34 45 47 67 77 86 89 99

9 26 37 45 55 67 78 86 89 99

21 31 39 46 56 75 78 87 90 102

1. Calculate the mean, median, and mode of time between failures of wire-cuts
2. The company would like to know by what time 10% (ten percentile or P10) and
90% (ninety percentile or P90) of the wire-cuts will fail?
3. Calculate the values of P25 and P75.
Solution
1. Mean = 57.64, median = 56, and mode = 46

2. Note that the data is arranged in increasing order in columns.


Position corresponding to Px  x (n+1)/100
The position of P10 = 10 × (51)/100 = 5.1.
We can round off 5.1 to its nearest integer which is 5. The corresponding value from table is
21 (10 percentage of observations in Table have a value of less than or equal to 21).
That is, by 21 hours, 10% of the wire-cuts will fail. In asset management (and reliability
theory), this value is called P10 life.

Instead of rounding the value obtained from Eq, we can use the following approximation: P10
= 10 × (51)/100 = 5.1
Value at 5th position is 21. Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th
position – value at 5th position) = 21 + 0.1(1) = 21.1
P90 = 90 × 51/100 = 45.9
The value at position 45 is 90 and at position 45.9 is 90 + 0.9 × (3) = 92.7
That is, 90% of the wire-cuts will fail by 92.7 hours

3. P25 (1st Quartile or Q1) = 25 × 51/100 = 12.75 , Value at 12th position is 33, so

P25 = 33 + 0.75 (value at 13th position – value at 12th position) = 33 + 0.75 (1) = 33.75

P75 (3rd Quartile or Q3) = 75 × 51/100 = 38.25


Value at 38th position is 86, so
P75 = 86 + 0.25 (value at 39th position – value at 38th position) = 86 + 0.25 (0) = 86
Percentile
• Percentile, decile and quartile are frequently used to identify the position of
the observation in the dataset.
• Percentile, denoted as Px, is the value of the data at which x percentage of the
data lie below that value
Position corresponding to Px  x (n+1)/100
• Px is the position in the data calculated , where n is the number of observations
in the data.
Decile and Quartile

• Decile corresponds to special values of percentile that divide the data


into 10 equal parts.
• First decile contains first 10% of the data and second decile contains first
20% of the data and so on.

• Quartile divides the data into 4 equal parts.


• The first quartile (Q1) contains first 25% of the data, Q2 contains 50% of
the data and is also the median. Quartile 3 (Q3) accounts for 75% of the
data
Bar Chart
• Bar chart is a frequency chart for qualitative variable (or categorical
variable)

• Bar chart can be used to assess the most-occurring and least-


occurring categories within a dataset

• Histograms cannot be used when the variable is qualitative


Bar chart for movie genre
Pie Chart

• Pie chart is mainly used for categorical data and is a circular chart that
displays the proportion of each category in the dataset
Pie chart for movie genre
Scatter Plot

• Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables

• The relationship could be linear or non-linear

• scatter plot is also useful for assessing the strength of the relationship
and to find if there are any outliers in the data
Scatter plot between movie budget and box office
collection
Box Plot (or Box and Whisker Plot)

• Box plot (aka Box and Whisker plot) is a graphical representation of


numerical data that can be used to understand the variability of the
data and the existence of outliers
• Box plot is designed by identifying the following descriptive statistics:
• Lower quartile (1st Quartile), median and upper quartile (3rd Quartile).
• Lowest and highest value
• Inter-quartile range (IQR).
IQR Box Plot

• The box plot is constructed using IQR, minimum and maximum values
Bollywood movie Budget Boxplot

• The box plot for the Bollywood movie budget

You might also like