Data Analysis in Business Research Key C
Data Analysis in Business Research Key C
Data Analysis in Business Research Key C
I
Research Scholar, Dept. of Management, Mizoram University Aizawl
II
Assistant Professor, Dept. of Management, Mizoram University Aizawl
Abstract
The data may be adequate, valid and reliable to any extent; it does not serve any worthwhile purpose unless it is carefully analyzed.
There are a number of techniques can be used while analyzing data. These techniques fall into two categories; descriptive and
inferential, constituting descriptive and inferential analysis. They can serve many purposes: to summarize the data in a simple
manner, to organize it so it’s easier to understand and to use the data to test theories about a larger population. Given the ready
availability of computer software, tedious formulae and calculations can be avoided today. But, there is no substitute for having a
good understanding of the conceptual basis of analytic methodologies that one applies in order to draw inferences about hard-own
research data. Hence, an effort has been made in this paper to provide theoretical introduction of few but most widely used analytical
tools, will allow one to produce meaningful data analysis in business research.
Keywords
Descriptive analysis, Inferential analysis, Hypothesis Testing, Estimation, Measures of Central Tendency, Measures of Dispersion,
Relationship etc.
C. Types of Analysis:
As mentioned earlier, in section 1, statistical analysis can
be categorized into descriptive and inferential analysis. Geometric mean: In many cases giving equal importance to all
1. Descriptive analysis: is mostly concerned with computation observations may lead to misleading answers. One of the measures
of certain indices or measures from the new data. Zikmund has of location can be used in such cases is geometric mean. The
quoted “…with descriptive analysis, the raw data is transformed Geometric Mean(G.M.) of a series of observations is deined as the
into a form that will make them easy to understand & interpret.” nth root of the product. It is to be noticed that if any of observation
It is largely the study of distributions of one variable. This sort of is zero calculating G.M. is not possible since the product of various
analysis can be analyzed data in three different ways: values becomes zero.
Univariate analysis: When a single variable is analyzed alone, Harmonic Mean: The Harmonic Mean (H.M.) is deined as
e.g., statistic 1such as “mean” which might refer to age group the reciprocal of the arithmetic mean of the reciprocals of the
of students, it is known as univariate analysis. observations. The mean is used in averaging rates when the time
Bivariate analysis: When some association is measured factor is variable and the act being performed is the same, such
as, for calculating average speed of car H.M. is used. The main
limitation of H.M. is it cannot be calculated if any value is zero. its square root is known as the standard deviation. Karl Perason
ii. Median: There are situations in which data set have extreme introduced these terms.
values at lower or higher end, termed as outliers in statistical c) Measure of Asymmetry (skewness): A distribution of data values
language. In such cases arithmetic mean is desirable to use, since is either symmetrical or skewed. In symmetrical distribution, the
it easily gets affected by those extreme values. For example, if the values below the mean are distributed exactly as the values above
data is such as 2, 3, 5, 2, 22, the mean will be 16.4 which cannot the mean. In this case, the low and high values balance each
be considered as good representative of data. Hence, another other out. In a skewed distribution, the values are not symmetrical
measure of location i.e. median is used in such cases. Further, around the mean. This results in an imbalance of low or high
whenever, exact values of some observations are not available values. Skewness is a measure of asymmetry in data. The data
median is used. can be negatively ( or left) skewed or positively (or right) skewed.
Median is the point that divides a distribution of scores in two In left skewed, most of the values are in the upper portion of the
equal parts, one part comprising all values greater and the other distribution, whereas, in right skewed, most of the values are in
all values less than median. To be remembered, the median is a the lower portion of the distribution.
hypothetical point in the distribution; it may or may not be an If the distribution is skewed, the extent of skewness could be
actual score. measure by, Bowley’s Coficient of Skewness or Pearson’s
iii. Mode: The third central tendency statistic is the mode. Mode is Measure of Skewness.
deined as the ‘most fashionable’ value, which observation is most Kurtosis is an indicator of peakedness of a distribution. Kark
frequently occurring in a set of data. For example, a data series Perason called it “Convexity of Curve”. A bell shaped or normal
is as 2, 3, 4, 2, 2, 6 and 9, the mode is 2 because 3 observations curve is Mesokurtic, whereas more peaked curve than the normal
have this value. Mode is frequently used in cases where complete curve is Leptokurtic and a curve more lat than the normal curve
data are not available, as well as, when the data is in quantitative is Platokurtic.
form where only getting a data regarding presence/ absence of Knowing the shape of distribution is necessary since some
the observation is possible. assumptions about their shape is made for the use of certain
b). Measures of variation or dispersion: In addition to central statistical methods.
tendency, every data can be characterized by its variation and shape. d) Measures of Relationship
In two or more data sets central tendency may be the same but Very often, researchers are interested to study the relationship
there can be wide disparities in the formation of the set. Variation between two or more variables, which is done with the help of
measures the dispersion, or disparities, of values in data set. correlation and regression analysis. The ideas identiied by the
Dispersion may be deined as statistical summaries throwing light terms correlation and regression were developed by Sir Francis
on the differences of items from one another or from an average. Galton in England.
Most commonly used in statistics are the standard deviation and Correlation is a statistical technique that describes the degree of
variance, but there are many other, discussed below: relationship between two variables in which with the change in
i.. Range: The range is the simplest measure of variation in a set value of one variable, the value of other variable also changes.
of data, and is deined as the difference between the maximum and The degree of correlation between two variables is called simple
minimum values of the observations. However, it only depends correlation. The degree of correlation between one variable and
on minimum and maximum values, and does not utilize the full several other variables is called multiple correlation.
information in the data, it is not considered very reliable. The simplest and yet probably the most useful graphical technique
ii. Semi Inter-Quartile Range or Quartile Deviation: Quartiles for displaying the relationship between two variables is scatter
split a set of data into four equal parts- the irst Quartile Q1, carries diagram (also called scatter plot). Here, the data for two variables
25 % of the data set values; the second Quartile Q2, carries 50% are plotted on x and y axis of graph. If the points are scattered
of the data set values; the third Quartile Q3, carries 75% of the around a straight line, the correlation is linear and if the points
data set values. The interquartile range (also called midspread) are scattered around a curve, the correlation is non-linear (curvi-
is the difference between third and irst quartiles in a data set i.e., linear).
Q3-Q1. The interquartle range measures the spread in the middle The scatter plot gives a rough indication of the nature and
50% of the data. strength of relationship between two variables, The quantitative
However, a much more popular measure of variation is Semi measurement of the degree/extent of correlation between two
Inter-Quartile Range or Quartile Deviation, and is deined as Q3- variables, is performed by coeficient of correlation. It was
Q1/ 2. developed by Karl Pearson, the great biologist and statistician,
iii. Mean or Average Deviation: Mean Deviation is deined as hence referred as “Pearsonian Correlation Coeficient” (also
known as Product moment correlation coeficient) It is denoted
by greek letter ρ(rho), when calculated from population values,
the average of difference of individual items from some average
of the series, can be mean, median or mode. Such difference of
individual items from some average value is termed as deviation. ‘r’ when calculated from sample values. The value of coeficient
While calculating mean deviation all deviations are treated as of correlation varies between two limits +1 and -1. The value +1
positive ignoring the actual sign. shows perfect positive relationship between variables, -1 shows
iv. Variance and Standard Deviation: In mean deviation negative perfect negative correlation and 0 indicates zero correlation. If the
sign is ignored, otherwise the total deviation comes out to be relationship between two variables is such that with an increase
zero, since similar values with opposite signs will cancel each in the value of one, value of other increases or decreases, in a
other. However, another way of getting over this problem of total ixed proportion, correlation between the variables is said to be
deviation being zero is to take the squares of deviations of the perfect. Similarly, perfect positive correlation means increase in
observations from the mean. The sum of squares of deviation one variable bring increase in other, in same proportion and vice
divided by number of observations is known as variance and versa. Perfect negative means increase in one variable decreases
the other variable in same proportion. Zero correlation shows may be noted that as many methods discussed until now involved
there is no linear relationship between two variables. It is to be only two variables i.e., simple correlation analysis and regression
noted that ‘r’ indicates the extent of only linear relationship. Zero analysis. However, very often, one is required to study the relation
value only indicates there is no linear relationship, but there could between more than two variables, impact of several independent
be other type of non-linear relationship. variables, jointly together, on dependent variable. This is possible
Above discussed, Pearsonian Correlation Coeficient is applicable through multiple correlation and multiple regression analysis
only when data is in interval or ratio form i.e. quantitative respectively. Here, multiple correlation coeficients are obtained
measurement of variables such as height, weight, temperature, which indicate the relation between one dependent variable
and income is possible. In some cases such as beauty, honesty or and several independent variables by using multiple regression
in similar cases where data is only available in ordinal or rank equations
form. Karl Pearson’s formula of correlation coeficient is not When correlation between any two variables is analyzed where the
possible. Hence, Charles Edward Spearman in 1904 developed effect of the third variable on these two variables is held constant
a measure called ‘Spearman Rank Correlation’ to measure or removed, then such analysis is known as partial correlation
correlation between ranks of two variables. It is denoted as rs. analysis and such correlation is termed as partial correlation
Value of correlation coeficient also ranges between +1 and -1. coeficient. Similarly, partial regression coeficient is the value
Spearman rank correlation is said to be a non-parametric or indicates the change that will be caused in dependent variable with
distribution free method, since it doesn’t fulil the assumption a unit change in independent variable when other independent
of normal distribution for both variables. One similar kind of variables held constant. But, as a matter of fact multiple correlation
method used for getting association between ranks of variables coeficients and multiple regression coeficients are applicable
is Kendall Tau rank correlation. only in case of ratio or interval data. In case of ordinal data such
When one or both of the variables is in categorical form i.e., not correlation can be enumerated by Kendall partial rank correlation
measurable but only on the basis of their presence or absence in & in case of nominal data discriminant analysis is used. (See
each case it is possible to know their frequency or total number Table 1)
of occurrences, the data said to be on nominal scale. In such
cases, to know the association between two attributes ‘coeficient Table 1: Choice of relationship analysis tool based on number of
of contingency’ or ‘ coeficient of mean square contingency’ variables and scale of measurement
introduced by Karl Pearson is used.
Correlation analysis deals with exploring the correlation between For twovariables(i.e. Pearson product moment
two or more variables. Whereas, regression analysis attempts to simple correlation) correlation coeficient
establish the nature of relationship between variables, that is, For interval or ratio data Spearman rank order
to study the functional relationship between the variables and For ordinal data correlation coefficient or
thereby provide a mechanism for predicting, or forecasting. Kendall Tau rank correlation
Example, correlations tells there is a strong relation between For nominal data Contingency coeficient
advertisement and sales. Regression will predict this much
increase in advertisement will give this much of increase in sales.
Regression analysis can be of two types- simple (deals with two For more than two Multiple regression analysis
variables) and multiple (deals with more than two variables). variables(i.e. multiple
If the relationship between two variables, one independent or correlation)
predictor or explanatory variable and other dependent or explained For interval or ratio data Kendall partial rank
variables, is a linear function or a straight line, then the linear For ordinal data correlation
function is called simple regression equation, and the straight For nominal data Discriminant analysis
line is known as regression line. It is a “line of best it” i.e. the NA
line on which the difference between the actual and estimated
values will be minimum. The simple regression equation is used
to make predictions. Source: Compiled by Authors
y = a + bx OR x= a + by are two possible regression equations 2. Inferential Analysis: Inferential analysis is mainly concerned
in case of two variables involved in regression analysis. First with (a) estimation of population values such as population mean,
equation said as regression equation of y on x and so on. In population standard deviation, and (b) various tests of signiicance/
irst equation y and in second equation x is dependent variable, testing of hypothesis. Inferential analysis plays a major role in
whereas x in irst equation and y in second equation is independent statistics since mostly it is not possible to go for whole population
variable. Here, ‘a’ and ‘b’ are constants, ‘a’ is intercept and ‘b’ while conducting researches, hence, a sample is chosen and using
is slope or inclination or most popularly known as regression inferential analysis the sample values obtained are used to infer
coeficient. Regression coeficient gives the change in dependent about the population. The objective of inferential analysis is to
variable when independent variable changes by 1 unit. To estimate use the information contained in a small sample of observations
the relationship between x and y it is vital to determine ‘a’ and for drawing a conclusion or making an inference about the larger
‘b’ respectively. This is done through the Principle of Least population. Such inference may be in the form of estimation or
Squares Method. Apart from that the Principles of Least Squares Testing of Hypothesis or assumptions. For example, either one
provide criterion to select “line of best it” mentioned in the last could estimate population parameter based on sample statistic,
paragraph. like ‘ mean life of a car battery’ , or one could test the claim of
In case, curved relationship is found between variables, then company that ‘mean life of car battery is 3 years’. In both the
correlation ratio eta (η) gives the degrees of its association. It cases an inference about population is made. There are various
methods of estimation and testing of hypothesis. Null hypothesis is denoted as Ho and alternative hypothesis is
a) Estimation: It deals with the estimation of parameters such as denoted as Ha. It has to be kept in mind that, we cannot prove a
population mean based on the sample values. The method or rule hypothesis to be true. We may ind the evidence that supports the
of estimation is called an estimator like sample mean, the value hypothesis. Suppose, we have failed to reject the null hypothesis,
which the method or rule gives in a particular case is estimate of doesn’t mean null hypothesis have been proven to be true, because
population parameter. In other words, estimator is a function of the decision is only made on the basis of sample information.
sample values to estimate a parameter of population. With the help Once the null and alternative hypothesis has been set up, the
of samples of observation, an estimate in the form of a speciic next step is to decide on the level of signiicance. It is used as
number like 25 years can be given or in the form of an interval a criterion for rejecting the null hypothesis. It is expressed as a
23-27 years. In the former case it is referred as point estimate, percentage like 5% or 1%, or sometimes as 0.05 or 0.01. It is
whereas in the latter case it is termed as interval estimate. that level, at which we are likely to reject null hypothesis even
i. Point estimate: It is used to estimate a population parameter, if it is true. Now decision on the appropriate statistic such as t,
with the help of sample of observations. A point estimate is a single z, f etc is taken. Based on the level of signiicance critical or
value, say 50. This number is taken as the best value of unknown tabulated value is found. After calculating the statistic from the
population parameter. An estimator is said to be eficient, if it has given sample of observation, the test statistic is compared with the
minimum variance such as sample arithmetic mean. critical value. If calculated value (statistic) is equal to or less than
There are several methods of estimating the parameters of a critical value, the difference between result and expected value
distribution, such as, maximum likelihood, least squares, methods is insigniicant and this insigniicant difference can be subjected
of squares and minimum chi-square. to sampling error, hence, null hypothesis is accepted. Whereas,
ii. Interval estimate: Point estimate gives a single value, taken as if calculated value is higher than critical value, the difference is
best estimate of parameter. However, if another data is collected said to be signiicant, and can’t be subjected to sampling error,
from same population, the point estimate may change. In real therefore null hypothesis is rejected.
life situation population parameter may not be exactly equal to Whenever we take a decision about population based on sample,
sample statistic, and could be around this value. Thus it may the decision cannot be 100% reliable. The possibilities can be,
be more logical to assume that the population values lies in an we would reject null hypothesis even if it is true, termed as Type
interval containing the sample, such as 48-52, known as interval I error, denoted as α or we could accept the null hypothesis even
estimate. It is expected that the true value of population will fall if it is false, termed as Type II error, denoted as β.
within this interval with the desired level of conidence; hence Type I error is also referred as level of signiicance, as discussed
the name ‘conidence interval’ is given. above. The quantity 1- β is called the ‘power’ of test, signifying
The interval should be in reasonable limits. These limits are the test ability to reject null hypothesis when it as false, and 1- α
statistically calculated. The limits or intervals, so arrived, are is called conidence coeficient.
referred to as conidence intervals or conidence limits. Since Various tests of signiicance have been developed to meet various
we are estimating population parameter from sample values, we types of requirements. They may be heavily classiied into,
can never make any estimation with 100% conidence. Desired parametric and non-parametric tests. Parametric tests are based
conidence for estimation is termed as conidence interval. Usually, on the assumptions that the observations are drawn from a normal
95% level of conidence is considered adequate. One can state as distribution. Since the testing procedure requires assumptions
‘with 95% conidence can say that population parameter will fall about the type of population or parameters values these tests are
somewhere between conidence interval of 40-50’. known as ‘parametric tests’. The test of signiicance developed
b). Testing of Hypothesis/Test of Signiicance: In most of the for situations when this condition is not satisied, known as ‘non-
cases, it is almost impossible to get knowledge about population parametric tests’ or ‘distribution-free tests’. As a matter of fact,
parameter, therefore, hypothesis testing or test of signiicance is parametric tests are more powerful test than non-parametric
the often used strategy for deciding whether sample offers such tests.
support for a hypothesis or assumptions that generalizations Various parametric and non-parametric tests of signiicance
about population can be made. In other words, test can ind the performing different functions in different conditions is mentioned
probability that a sample statistic would differ from a parameter below in a tabular form:
or another sample .
Hypothesis testing typically begins with some assumptions or Table 2: Choice of parametric/non-parametric test based on
hypothesis or claim about a particular parameter of a population. function to perform & scale of measurement
It could be the parameters of a distribution like mean, describing Non-Parametric Parametric tests
the population; the parameters of two or more population, Function Tests (Ordinal/ Nominal
correlations or associations between two or more characteristics of (Interval/Ratio Data)
a population. Hypothesis can be of two types, null and alternative Data)
hypothesis. Null hypothesis is considered to be a hypothesis of “no Test of ‘t’(mean known,
relationship”. Such as ‘there is no signiicant difference between Signiicance of S.D.* unknown)
sample means’. The term Null hypothesis is said to have been one sample test ‘z’(mean known, Sign test
introduced by R. A. Fisher. The word Null is used because the S.D. known)
nature of testing is that we try our best to nullify or reject this
hypothesis based on sample collected. When null hypothesis is
rejected the opposite of null hypothesis i.e. alternative hypothesis
is automatically accepted. Alternative hypothesis is the statement
which is intended to be accepted if the null hypothesis is rejected.