RM Notes 2020
RM Notes 2020
METHODOLOGY
NOTES
1. AN OVERVIEW OF RESEARCH METHODOLOGY
• Objectivity of Investigator
– Unbiased
– Procedural integrity
– Accurate reporting
• Accuracy of Measurement
– Valid and Reliable
– Meaningful and useful
– Appropriate design (sample, execution)
• Open-minded to Findings
– Willing to refute expectations
– Acknowledge limitations
Objectives of Research
To achieve new insight into a phenomenon (exploratory / formulative research.
Why Research?
• Taking the challenge to solve an unsolved problem.
• Desire to get intellectual satisfaction of doing some creative work.
• Desire to get a research degree.
• Desire to move up the career ladder in the academic institutions.
• Desire to be of service to the society.
Significance of Research
1. Research inculcates scientific and inductive thinking and it promotes the development
of logical habits of thinking.
2. Research provides the basis for nearly all govt. policies in our economic system.
3. It helps to solve various operational and planning problems of business and industry.
Market research / operations research/demand forecasting
(8) (2)
(7) Follow-up Determine the
Preparing and Research
Presenting the Design
(1)
Report
Identifying the
Research (3)
(6) Problem/Opport
Analyzing the Determine Data
unity Collection
Data
Method
(5)
Design Sample (4)
and Collect Design Data
Data Collection
Forms
Validity - Have you measured (or observed) what you think you have? Were the instruments
used suitable for purpose? Have you adequately and faithfully captured the ‘state of affairs’?
Reliability
Even if the methods are valid, can we be sure that the data are consistent and a true
reflection of the phenomena under study?
Replicability
This is essential in scientific work, it means that the work has been done and described in
such a way that it is repeatable. In social science exact replication is often impossible, but
similar studies and to the weight of evidence.
Generalisability
Are the findings generally applicable, for example to other contexts, situations, times, or
persons other than the sample?
Establishing Credibility
Credibility is a property of good research.
2
Care and attention in writing it up in such a way that readers have confidence in the
integrity of the work.
Deductive Approach:
Deductive reasoning works from the more general to the more specific. It is a "top-down"
approach. We might begin with thinking up a theory about our topic of interest. We then
narrow that down into more specific hypotheses that we can test. We narrow down even
further when we collect observations to address the hypotheses. This ultimately leads us to
be able to test the hypotheses with specific data -- a confirmation (or not) of our original
theories.
Inductive Approach:
Inductive reasoning works the other way, moving from specific observations to broader
generalizations and theories. This is a "bottom up" approach. In inductive reasoning, we
begin with:
Types of Research
1. Descriptive (Ex-post facto research) Vs. Analytical (Critical Evaluation of the
material).
2. Applied (Action) Vs. Fundamental (Basic or Pure).
3. Quantitative (Inferential/experimental/ simulation) Vs. Qualitative.
4. Conceptual (abstract idea or theory) Vs. Empirical (Experience or observations based
on data)
5. Longitudinal Research (Over a time period such as clinical or diagnostic research) Vs.
Laboratory or Simulation Research.
Qualitative Research
By contrast, a study based upon a qualitative process of inquiry has the goal of
understanding a social or human problem from multiple perspectives. Qualitative research
3
is conducted in a natural setting and involves a process of building a complex and holistic
picture of the phenomenon of interest. In qualitative research, a hypothesis is not needed
to begin research.
• Quantitative Research
• Qualitative Research
In qualitative research, however, it is thought that the researcher ca n learn the most by
participating and/or being immersed in a research situation
4
The research is confirmatory rather than exploratory i.e. this is a frequently
researched topic, and (numerical) data from earlier research is available.
There is no ambiguity about the concepts being measured, and only one
way to measure each concept.
You are studying the trends in weather in the town where you live. There aren't many
variables: temperature ranges, wind speed, rainfall, barometric pressure, and perhaps
a few others. Most of the variables are measured mechanically, and a lot of historical
data exists. You wouldn't even consider doing qualitative research on this.
Research Methods:
Methods of data collection
Statistical methods used for establishing relationships between the data and the
unknowns.
Methods used to evaluate the accuracy of results obtained.
Research Methodology:
Research Methods
Consideration of the logic behind the methods we use.
Research Process
Series of actions or steps necessary to effectively carry out research and the desired
sequencing of these steps.
• Rephrasing the same into meaningful terms from an analytical point of view.
5
• Studies of broader literature (Conceptual and empirical)
This stage is important because
1. The research problem needs to be defined unambiguously.
2. It helps to collect the relevant data, choice of research methods etc.
Primary Data-
By observation, through personal interviews, telephone interviews and by mailing
of questionnaire
Secondary Data
G. Analysis of Data:
1. Computation of statistics viz., mean, median, mode, standard Deviation,
coefficient of variation, coefficient of skewness etc.
6
4. Factor, Discriminat, Conjoint analysis
H. Hypotheses Testing:
I. Interpretation of Results:
K. Report Writing
7
2. RESEARCH DESIGN
RESEARCH DESIGN
It is a conceptual structure within which research is conducted. It constitutes the blue
print for the collection, measurement and analysis of data
8
• Causal Research (experiments and other approaches): allows isolation of causes
and effects via use of experiment or surveys.
1. Extraneous variables:
(Independent variables that are not related to the purpose of the study but may affect
the dependent variable are terms as extraneous variables).
3. Control:
(The technical term is used when we design the study minimizing the effects of
extraneous independent variables).
4. Confounded relationships
(When the dependent variable is not free from the influence of extraneous variables).
5. Research Hypotheses:
The research hypothesis is a predictive statement that relates an independent
variable to a dependent variable.
Predictive statements which are assumed but not to be tested are not termed as
research hypotheses.
9
Experimental and non-experimental hypothesis testing research:
For example mother’s age 15-45 overall group above 15 years.
7. Experimental and control groups: For example dummy variable of caste / religion.
10
3. METHODS OF DATA COLLECTION
• Secondary data: information that has previously been gathered by someone other
than the researcher and/or for some other purpose than the research project at hand
11
• Possible coding problems
• Data may be available but it may have problems:
Missing or incomplete data.
Unknown definitions of data.
Changed definitions or procedures.
Might be too aggregated.
Types of Observations
Structured (Descriptive)
Unstructured (Exploratory)
Participant (Anthropological)
Non-Participant (Political forecasts)
Disguised Participation (Presence of the observer is hidden)
In-Depth Techniques:
Focus groups Interviews
Interviews: Personal, Telephonic, Focused, Non-Directive
Projective Techniques
12
• It provides greater information in-depth.
• Can overcome the resistance through persuasion .
• Personal information can be sort.
• Low no-response.
• Can secure most spontaneous reactions.
• Adaptability of the language to the level of interviewee.
• Can collate supplementary information which maybe of great value in
interpreting the results.
• Interviewer can clarify unclear questions
• Literacy is not required
• Interviewer can collect more complex answers and observations
• Interviewer can minimize missing and inappropriate responses
• Interviewer can prevent respondent from answering out of sequence
Pre-requisites of interviewing
Demerits:
1. No thinking space to the interviewee
2. Survey is restricted to those, who have telephone facilities.
3. Unsuitable for intensive surveys where comprehensive answers are required.
4. Greater possibility of bias.
13
Merits:
Low cost
Free from the bias of the interviewer.
Enough thinking space.
Can be reached to otherwise inaccessible people.
Sample could be larger.
Demerits:
Low rate of return.
Only the educated and cooperating people could be approached.
Difficulty in modifying the approach once the questionnaire is made.
Possibility of ambiguous replies/omissions of questions.
Method is slightly to be slowest of all.
Open-ended questions generate answers that are more nuanced and information-rich.
They permit subject freedom to answer question in own words (without pre-specified
alternatives).
Open-ended questions do not provide respondents with any answers from which to choose.
– Advantages:
• Not forced to choose between categories
• May better reflect respondents thoughts\beliefs
• Appropriate when list of possible answers is excessive
• Lets respondent have the say, let him tell the researcher what he means, and not
vice-versa (obtain unanticipated answers)
14
– Disadvantages:
• Respondent may say too much or too little
• Provide incomplete or unintelligible answers
• Flexibility in responses difficult to code and analyze -Interpretations of answers
may vary
• Too much variance in response
• Expensive and time-consuming
Closed-ended Questions
Closed-ended questions provide respondents with a list of responses from which to choose.
Alternatively, closed-ended questions can provide multiple choices for the respondent to
accept or reject
- Advantages:
• Easy to answer and takes little time
• Answers can be precoded (assigned a number) and easily transferred to a
computer
• Answers are easy to compare
• Easier to elicit responses to sensitive questions
• Answers are more reliable
• Meaning of responses more meaningful to researcher
– Disadvantages:
• May not be accurate--forces people to accept categories, or puts too many people
into “other” category
• Answers relative to response scale provided
• Respondent's choice not among listed alternatives
• Choices listed communicate kind of response wanted
• Wording of response choices may influence responses
1) Questionnaires are sent through mail to the informants while schedules are filled in either
by the researcher himself or by the enumerators who are specially appointed for the
purpose.
2) Questionnaire is relatively cheap but data collection through schedules in expensive.
3) Non-response is high in case of a questionnaire.
4) In case of a questionnaire, identity of the person who has actually filled in may be
unknown as he/she might be doing it on behalf of someone else.
5) Questionnaire method is slow as many respondents may not return the filled in response
in time.
6) Personal contact is not possible in case of questionnaires.
7) Questionnaire method can be used only when respondents are literate and cooperative.
8) Coverage with questionnaire could be wider and cheap.
15
9) Risk of collecting incomplete and wrong information is relatively more under the
questionnaire method particularly when people are unable to understand questions properly.
10) Observation method can also be used along with the schedules but it is not possible with
the questionnaire.
16
4. BASIC STATISTICAL MEASURES
The data of given situation must be characterized by some statistical measure for the purpose
of estimation or comparison with similar data or making inference about the sample
population to which the data belong.
4. Kurtosis: Even if know measures of central tendency, Variation and Skewness we still
cannot form a complete idea about distribution. We should know convexity of distribution
or of the frequency curve or Kurtosis. Kurtosis enables us to have idea about the flatness of
or peaked ness of the frequency curve.
It is Measured by Coefficient B2 or its derivative Gamma2
Normal curve is called masokurtic curve.
5. Time Series: Many a times researchers have to deal with quantities which changes in
value with time. For obtaining the knowledge about the nature of variation of a quantity
along with time, time series can be used.
17
Graphs can used to plot such values and is called Histogram.of time series.
These causes are segregated and this process is called analysis of time series.
The variations in the value of the variance can be analysed into the following three
main components.
1. The basic or long time period
2. Short time of periodic changes
3. Irregular fluctuations
Measurement of Trend:
a. Free hand smoothing
b. Sectional Average: in this whole series is divided into suitable number of sections
and average of each section is found. These averages are plotted against the mid year
of the sections, then a free hand smooth curve is drawn through these points. The
curve represents Trend.
c. Method of Moving Average: In this fluctuations due to cyclical changes are
eliminated by averaging the values of the variance for a specified number of
successive years. Number of years over which the values are averaged depends upon
the average length of the cycle found in the series. Then mean is taken. All these
means are plotted and successive points are joined by straight line segment. The
resulting polygonal graph indicates the trend of the given time series.
d. Method of Least Squares: From the mid year of the time series , time deviations are
to be taken and these deviations are to be squared. Then multiply the values with the
squares. The resultants are trend ordinates. When these are plotted against the
corresponding year we get the line of best fit in the sense of least square.
18
Y = a+ b X +u is a linear form of regression equation.
U is a random error to fit the st. line we apply the method of least squares. In order to
estimate a and b we need to minimize sum of square of Ui. For this we solve the required
regression equations.
19
5. SAMPLING METHODS
Advantages of Sampling:
(1) Less time taken to collect data
(2) Less cost for data collection
(3) Physical impossibility of complete enumeration necessitates sampling
(4) More accuracy of data collected due to its limited size.
Sampling Frame: The complete list of all the members/units of the population from which
each sampling unit is selected is known as sampling frame. It should be free from error.
Sampling Methods:
Probability Sampling: In probability sampling each unit of the population has a probability
of being selected as an unit of the sample. But this probability varies from one method to
another method of probability sampling.
In non-probability sampling there may be instances that certain units of population will have
zero probability of selection, because judgment biases and convenience of the interviewer
are considered to be the criteria for the selection of sample units.
20
Unit are selected from the population based on the respective probability using
Monte-Crlo Simulation.
(2) Systematic Sampling: This is a special kind of random sampling in which the
selection of the first unit of the sample from the popo is based on randomistion. The
remaining units of the sample are selected from the pop at a fixed interval of n, where
n is a sample size.
Sampling interval width I = N/n
(3) Stratified Sampling: It is an improvised sampling over simple random sampling and
systematic sampling.
In this sampling pop is divided into specified set of strata such that members with
in stratum have similar attributes but members between strata have dissimilar
attributes.
(a) Proportional Stratified sampling: when same proportion of units are selected
from each stratum. There is no much difference (less variance) in attributes
with in each stratum.
n is the sample selected such that n = n1+n2+------ +nk
N=pop size, Ni=Strata Size; ni = size of sub sample
n1/N1=n2/N2= --------=nk/Nk=n/N
n1= n.N1/N ---------------------- nk=n.Nk/N
(b) Dis proportional S.S.: When different proportion of units are selected from
each stratum. Attributes differ and there is high variance. In this sampling the
stratum which has more variance will have prop more sampling units as
compared to other stratum with less variance.
No of sampling units of the stratum i=ni= qi.si. n/ Sum qisi si
is st deviation of stratum i
qi= Ni/N
(4) Cluster Sampling: Pop is divided into different clusters. Memebers within the cluster
are dissimilar in terms of their attributes. But different clusters are similar to each
other. Each cluster can be treated as a small population which possess all the
attributes of the pop. Any one cluster is selected and all units of cluster constitute the
sample.
(5) Multistage Sampling: In a large scale survey covering the entire nation the size of
the sample frame will be very large. In such study multistage sampling technique is
used.
Stage 1: Different states of the country are sampled from each region using stratified
sampling. Here it is assumed that the states within the region are similar and the regions
are dissimilar.
Stage 2: Then cluster sampling is can be used from each selected state by assuming that
different districts of each state as its cluster.
Stage 3: In each selected district a random sampling may be used to select the
proportional number of units from it.
21
1. Convenience Sampling
2. Judgment Sampling also called Purposive Sampling
3. Quota sampling
4. Snowball Sampling
22
6. HOW TO CONDUCT SURVEYS
Introduction
This lecture will help you to learn how to conduct a survey and design a
questionnaire. Survey research is being conducted in almost all areas of management,
economics and social sciences. It is quite relevant to understand the various techniques and
tools used in survey research. As a faculty of management and social sciences, you conduct
market research, socio-economic evaluation study, opinion-based studies, and policy and
programme assessment studies. For all such type of studies, conducting survey through
questionnaire becomes essential. Keeping this in view, this lecture focuses on survey
research techniques and how to design a questionnaire that gets the true opinions of your
sample. Questionnaires are the most common marketing research method. They are used for
structured interviews, written surveys, email, and internet surveys.
Conducting a survey is a useful way of finding something out, especially when
`human factors' are under investigation. Although surveys often investigate subjective
issues, a well-designed survey should produce quantitative, rather than qualitative, results.
That is, the results should be expressed numerically, and be capable of rigorous analysis.
Researchers quite often underestimate how difficult it is to carry out a survey well; a good
survey is more than a handful of questionnaires and a couple of bar charts: it requires careful
planning, methodical application, and detailed analysis of the results
1. A literature search involves reviewing all readily available materials. These materials
can include internal company information, relevant trade publications, newspapers,
magazines, annual reports, company literature, on-line databases, and any other published
materials.
2. Talking with the people is a good way to get information during the initial stages of a
research project. It can be used to gather information that is not publicly available, or that is
too new to be found in the literature. Examples might include meetings with prospects,
customers, suppliers, and other types of business conversations at trade shows, seminars, and
association meetings. Although often valuable, the information has questionable validity
because it is highly subjective and might not be representative of the population.
3. A Focus Group is used as a preliminary research technique to explore people’s ideas and
attitudes. It is often used to test new approaches (such as products or advertising), and to
discover customer concerns. A group of 6 to 20 people meet in a conference-room-like
setting with a trained moderator. The room usually contains a one-way mirror for viewing,
including audio and video capabilities. The moderator leads the group's discussion and keeps
the focus on the areas you want to explore. Their disadvantage is that the sample is small and
may not be representative of the population in general.
23
4. Personal Interviews are a way to get in-depth and comprehensive information. They
involve one person interviewing another person for personal or detailed information.
Personal interviews are very expensive because of the one-to-one nature of the interview.
Typically, an interviewer will ask questions from a written questionnaire and record the
answers verbatim. Personal interviews (because of their expense) are generally used only
when subjects are not likely to respond to other survey methods.
5. Telephone Surveys are the fastest method of gathering information from a relatively large
sample (100-400 respondents). The interviewer follows a prepared script that is essentially
the same as a written questionnaire. However, unlike a mail survey, the telephone survey
allows the opportunity for some opinion probing. Telephone surveys generally last less than
ten minutes.
5. Mail Surveys are a cost-effective method of gathering information. They are ideal for
large sample sizes, or when the sample comes from a wide geographic area. They cost a little
less than telephone interviews, however, they take over twice as long to complete (eight to
twelve weeks). Because there is no interviewer, there is no possibility of interviewer bias.
The main disadvantage is the inability to probe respondents for more detailed information.
6. Email and Internet surveys are relatively new and little is known about the effect of
sampling bias in Internet surveys. While it is clearly the most cost effective and fastest
method of distributing a survey, the demographic profile of the Internet user does not
represent the general population, although this is changing. Before doing an e-mail or
Internet survey, carefully consider the effect that this bias might have on the results.
What is Survey?
A survey is a method of collecting information directly from people about their ideas,
feelings, health, plans, beliefs, and social, educational and financial background. It usually
takes the form of self-administered questionnaires and interviews. Self- administered
questionnaires can be completed by hand or by computer. Interviews take place in person or
on telephone.
Surveys are conducted to meet policy or programmes needs. For instance, a company
is considering providing day care for children of its working staff. How many have young
children? How many would use the agency services?
24
conduct research when the information you need should come directly from people. The
data they provide are descriptions of attitudes, values, habits and background characterizes
such as age, health, education and income.
Types of Survey
1. Cross-sectional Survey
With this design, data are collected at a single point in time. Think of a cross
sectional survey as a snapshot of a group of people or organizations. Cross-sectional
surveys have several advantages.
2. Longitudinal Surveys
With longitudinal survey, data are collected over time. At least three variations are
particularly useful.
(a) Trend: a trend design means surveying a particular group over time. For example,
studying a group of rural people’s socio-economic conditions over time.
(b) Cohort: In cohort survey, you study a particular group over time but people in the
group may vary.
(c) Panel: panel survey means collecting data from the same sample over time.
Do you think the left parties and the congress will soon reach a greater degree of
understanding? (Biased question)
When you have questions that you suspect encourage strong views on either side.
25
Better: in your opinion, in the next two years, how is the relationship between the left
parties and congress likely to change?
Much improvement
Some improvement
Some worsening
Much worsening
Impossible to predict
6. Use caution when asking about the personal: another source of bias may result
from questions that may intimidate the respondent.
How much do you earn each year? Are you single or divorced? How do you feel about
your teacher, counselor or doctor? When personal information is essential to the survey,
you can ask questions in the least emotionally charged way if you provide categories of
responses.
Example:
7. Each question should have just one thought: do not use questions in which a
respondent’s truthful answer could be both ways yes and no at the same time
Survey Design
Here, we shall discuss options and provides suggestions on how to design and conduct a
successful survey research. There are 7 steps in the survey research:
Setting Objectives
The first step in any survey is deciding the objectives. If your objectives are unclear, the
results will probably be unclear. Some typical objectives may be like these:
26
Opinions about political candidates or issues
Corporate images
There are two main components in determining whom you will interview. The first
is deciding what kind of people to interview. Researchers often call this group the target
population. If you conduct an employee attitude survey or an association membership
survey, the population is obvious. If you are trying to determine the likely success of a
product, the target population may be less obvious. Correctly determining the target
population is critical. If you do not interview the right kinds of people, you will not
successfully meet your goals.
The next thing to decide is how many people you need to interview. You must make a
decision about your sample size based on factors such as: time available, budget and
necessary degree of precision.
Bias
A survey is biased if its outcome has been influenced by factors other than the one being
studied. Bias is occasionally overt: the experimenter is not open-minded about the results,
and interprets them wrongly. But more often bias comes from poor survey design. A typical
problem is that of comparing two groups of people that are not really alike. For example, if
there are more men than women in one group, and more women than men in another, the
responses of the groups to any question will be influenced by the differences between men
and women. The solution to this problem is that of randomization. In some cases it is
necessary to use `stratified' random sampling to ensure that the sample is typical of the
population.
Selecting Respondents
Select survey respondents at random from the intended audience. If at all possible, identify
a comparison group that doesn't get the information so that you can see how much of the
change in knowledge, attitude, and/or behavior is a result of your information versus a result
of other factors in the market place. This is a variation on a control group; in a real
experiment, you would randomly assign people to either a group that gets the information or
the control group that would not. But random assignment is not feasible in the context of
report cards, so a comparison group is an acceptable alternative. One of the easiest ways to
create a comparison group is to collect baseline data, i.e., responses to key questions
collected before the information was disseminated. This is often referred to as a "pre/post"
survey. You do not have to contact the same people before and after the distribution period.
But be sure to survey a representative sample each time so that their responses are
comparable.
Interviewing Methods
Once you have decided on your sample you must decide on your method of data collection.
Each method has advantages and disadvantages. Some of the methods are listed as follows:
1. Personal Interview: An interview is called personal when the Interviewer asks the
questions face-to-face with the Interviewee. Personal interviews can take place in the home,
at a shopping mall, on the street, outside a movie theater or polling place, and so on.
27
2. Telephone Surveys: Surveying by telephone is the most popular interviewing method.
This is made possible by nearly universal coverage.
3. Mail Surveys: One way of improving response rates to mail surveys is to mail a postcard
telling your sample to watch for a questionnaire in the next week or two. Another is to follow
up a questionnaire mailing after a couple of weeks with a card asking people to return the
questionnaire.
4. Computer Direct Interviews: These are interviews in which the Interviewees enter their
own answers directly into a computer. They can be used at malls, trade shows, offices, and
so on. Some researchers set up a Web page survey for this purpose.
5. Email Surveys: Email surveys are both very economical and very fast. More people have
email than have full Internet access. This makes email a better choice than a Web page
survey for some populations. On the other hand, email surveys are limited to simple
questionnaires, whereas Web page surveys can include complex logic.
6. Internet/Intranet (Web Page) Surveys: Web surveys are rapidly gaining popularity. They
have major speed, cost, and flexibility advantages, but also significant sampling limitations.
These limitations make software selection especially important and restrict the groups you
can study using this technique. Internet survey is recommended mainly when your target
population is Internet users. Business-to-business research and employee attitude surveys
can often meet this requirement. Another reason to use a Web page s urvey is when you want
to show video or both sound and graphics. A Web page survey may be the only practical
way to have many people view and react to a video.
28
Part-II
DESIGNING QUESTIONNAIRE/SCHEDULE
Introduction
Questionnaire is widely used for data collection in survey research. It is fairly reliable tool
for gathering data from large, diverse, varied and scattered groups. Questionnaire is a list of
questions sent to a number of persons for their answers and which obtains standardised
results which can be tabulated and treated statistically.
Types of Questionnaire
Questionnaire may be broadly of two types, viz. Structured and unstructured
questionnaire. According to P.V. Young, structured questionnaires are those “which pose
definite, concrete, and pre-determined questions, i.e.; they are prepared in advance and not
constructed on the spot during the question period”. Additional questions may be asked only
when some clarification is required. Answers to these questions are normally given with
high precision. For e.g. age, sex, marital status, number of children nationality etc., are
automatically structured. Structured questionnaire may further be grouped into closed form
or open-end questionnaire. A close form questionnaire is one in which questions are set in
such a manner that it leaves only a few alternative answers. The informant is left with only
a few choices to answer them. For e.g., do you think poverty and unemployment have
increased in India after economic reform? Yes/No/Can’t say.
In above stated question, respondent has to select one out of three alternatives. The
open-ended questionnaire, on the other hand, is one in which the respondent has full choice
of using his own style and diction of language expression, length and perception. He has
enough freedom while providing answers to open questions.
The unstructured questionnaire contains a set of questions which are not structured
in advance and which may be adjusted according to the need of question period. The
unstructured questionnaire is used mainly for conducting interviews. Flexibility is its chief
merit.
Let us take a hypothetical case; we want to identify the most important problem
facing the country. In open-closed experiment people are asked what they think is the most
29
important problem facing the nation. In a close-ended framework, we set five alternatives,
namely, unemployment, economic disparity, crime, poor governance and inflation. One
open-ended question is also set. In response to the open-ended questions, the respondents
may identify power shortages the most vital problem of the country. Thus, open-ended
questions are also relevant, especially when the researcher has inadequate knowledge about
the various problems faced by the country.
Construction of Questionnaire
A. General Considerations
1. Most problems with questionnaire analysis can be traced back to the design phase
of the project. Well-defined goals are the best way to assure a good questionnaire
design. When the goals of a study can be expressed in a few clear and concise
sentences, the design of the questionnaire becomes considerably easier. The
questionnaire is developed to directly address the goals of the study.
2. One of the best ways to clarify your study goals is to decide how you intend to use
the information. This sounds obvious, but many researchers neglect this task.
3. Be sure to commit the study goals to writing. Whenever you are unsure of a
question, refer to the study goals and a solution will become clear. Ask only
questions that directly address the study goals.
4. KISS - keep it short and simple. If you present a 20-page questionnaire most
potential respondents will give up in horror before even starting. A one of the most
effective methods of maximizing response is to shorten the questionnaire.
5. If your survey over a few pages, try to eliminate questions. Many people have
difficulty knowing which questions could be eliminated. For the elimination round,
read each question and ask, "How am I going to use this information?" If the
information will be used in a decision-making process, then keep the question... it's
important. If not, throw it out.
6. Involve other experts and relevant decision-makers in the questionnaire design
process.
7. Formulate a plan for doing the statistical analysis during the design stage of the
project. Know how every question will be analyzed and be prepared to handle
missing data. If you cannot specify how you intend to analyze a question or use the
information, do not use it in the survey.
8. Provide a well written cover page. The respondent's next impression comes from
the cover letter (for mailed questionnaire). It provides your best chance to persuade
the respondent to complete the survey.
9. Giver your questionnaire a title that is short and meaningful to the respondents. A
questionnaire with a title is generally perceived to be more credible than one
without.
10. Begin with a few non-threatening and interesting items. If the first items are too
threatening or "boring", there is little chance that the person will complete the
questionnaire.
11. Leave adequate space for respondents to make comments. Leaving space for
comments will provide valuable information not captured by the response
categories.
12. Place the most important items in the first half of the questionnaire. Respondents
often send back partially completed questionnaires.
30
13. Use professional production methods for the questionnaire—either desktop
publishing or typesetting and key-lining. Be creative.
14. The final test of a questionnaire is to try it on representatives of the target
audience.
B. Language
The wording of a question is extremely important. Researchers strive for objectivity
in surveys and, therefore, must be careful not to lead the respondent i nto giving a desired
answer. Many investigators have confirmed that slight changes in the way questions are
worded can have a significant impact on how people respond.
Because questionnaires are usually written by educated persons who have special
interest in and understanding of the topic of their investigation and because these people
usually consult with other educated and concerned persons, it is common for questionnaires
to be overwritten, over complicated, and too demanding of the respondent. Therefore, it
requires special measures to cast questions that are clear and straight forward in four
important aspects; simple language, common concepts, manageable tasks and widespread
information.
In choosing the language for a good questionnaire, the nature and structure of
population to be studied should be kept in mind. Technical terms and jargons should be
avoided to the maximum possible extent. Words used in ordinary conversation should be
preferred. For example:
Acquaint - inform
Assist - help
Consider - think
Reside - live
State - say
Sufficient - enough
Initiate - start and so on
31
C. Question Content
A questionnaire designer has to ensure that all the necessary items are duly
incorporated in the questionnaire. The investigator may take the help of standard checklists
to see that all the required items are included in the questionnaire. The checklist can also be
prepared by the investigator himself. Check lists may differ depending upon the aims and
objectives of the survey research. Some of the important items of checklist of content are
as follows:
1. Is this question necessary for clear understanding? Just how well it is used.
2. Are several questions needed on the subject matter of this one question?
3. Do the respondents have the information necessary to answer the questions?
4. Does the question need to be more concrete, more specific and closely related to
the respondent’s experience?
5. Is the question content sufficiently general and free from superiors concreteness
and specificity?
6. Is the question content biased or loaded in one direction – without accompanying
questions to balance the emphasis?
D. Question Types
Researchers use three basic types of questions: multiple choice, numeric open end and
text open end (sometimes called "verbatim"). Examples of each kind of question follow:
1. Where do you live? (1) Northern Region (2) Central Region (3) Eastern region
(4) Western region (5) Southern region
1. Excellent
2. Very good
3. Good
4. Fair
5. Poor
On a scale where “10” means you have a great amount of interest in a subject and
“1” means you have none at all, how would you rate your interest in each of the following
topics?
32
1. New economic policy
2. SEZ policy
3. Corporate social responsibility
4. Labour Market reforms
Clearly, there are many problems with this question. What if the respondent doesn' t own
a microcomputer? What if he owns a different brand of computer? What if he owns both
an IBM PC and an Apple? There are two ways to correct this kind of problem.
The first way is to make each response a separate dichotomous item on the questionnaire.
For example:
33
Do you own an Apple computer? (circle: Yes or No)
Another way to correct the problem is to add the necessary response categories and allow
multiple responses. This is the preferable method because it provides more information
than the previous method.
4. Has Mutually exclusive options. A good question leaves no ambiguity in the mind of the
respondent. There should be only one correct or appropriate choice for the respondent to
make
Since almost all responses would be choice B, very little information is learned.
6. Does not presuppose a certain state of affairs. Among the most subtle mistakes in
questionnaire design are questions that make an unwarranted assumption. An example of
this type of mistake is:
Are you satisfied with your current auto insurance? (Yes or No)
This question will present a problem for someone who does not currently have auto
insurance. Write your questions so they apply to everyone.
One of the most common mistaken assumptions is that the respondent knows the correct
answer to the question. Industry surveys often contain very specific questions that the
respondent may not know the answer to. For example:
7. Does not imply a desired answer. The wording of a question is extremely important. As
examples:
34
8. Does not use emotionally loaded or vaguely defined words. Quantifying adjectives (e.g.,
most, least, majority) are frequently used in questions. It is important to understand that these
adjectives mean different things to different people.
E. Question Sequence
Items of a questionnaire should be grouped into logically coherent sections.
Grouping questions that are similar will make the questionnaire easier to complete, and the
respondent will feel more comfortable. Questions that use the same response forma ts, or
those that cover a specific topic, should appear together. Each question should follow
comfortably from the previous question. Writing a questionnaire is similar to writing
anything else. Transitions between questions should be smooth. Questionnaires that jump
from one unrelated topic to another feel disjointed and are not likely to produce high
response rates.
Some researchers have suggested that it may be necessary to present general
questions before specific ones in order to avoid response contami nation. Other researchers
have reported that when specific questions were asked before general questions,
respondents tended to exhibit greater interest in the general questions.
The numbering of questions should be in a logical sequence. To check the sequence
of questions the following questions should be answered.
1. Are the answers to the questions likely to be influenced by the content of the
preceding questions?
2. Are the questions led up to in a natural way?
3. Do some questions come too early or too late from the point of view of arousing
interest and receiving sufficient attention, avoiding resistance and inhabitations?
35
15. Decide whether a personal or impersonal question will obtain the better response.
16. Questions should be limited to a single idea or a single reference.
1. He must plan in advance and should fully know the problem under consideration. He
must choose a suitable time and place so that respondent should be ease during
interview.
2. All possible efforts should be made to establish proper rapport with the informant;
people are motivated to communicate when the atmosphere is favourable.
3. He must know that ability to listen with understanding, respect and curiosity is the
gateway to communication, and hence acts accordingly during the survey.
4. Investigator’s approach must be friendly and informal. Initially friendly greetings in
accordance with the cultural pattern of the respondent should be exchanged and then
the purpose of the survey should be explained.
5. To the extent possible, there should be a free-flow interview and the questions
must be well phrased in order to have full cooperation of the respondent.
36
7. PARAMETRIC AND NON-PARAMETRIC STATISTICAL TESTS
Test of Significance:
The procedure to access the significance of the difference between a sample statistics and
corresponding population parameter or difference between two independent statistics is
called test of significance.
Example Agronomist wants to establish from his research experiment data if the average
yield of new variety has some specific value or not.
Question is how to arrive at the conclusion whether difference is real (significant) or due to
chance (called non-significant) and how large difference is to be considered statistically
significant.
Alternate Hypothesis
1. H1: a = b (Two Tailed alternate)
2. H1: a > b ( one Tailed ; Right tailed alternate)
3. H1: a < b ( one Tailed ;Left tailed alternate)
37
Significance Level: The significance is the probability with which null hypothesis will be
rejected due to sampling error though it is true.
Decision to reject or accept null hypothesis depends upon the information contained in the
sample and there is always a risk of taking wrong decision. One is likely to commit two
types of errors.
TYPE –I Error: The error of rejecting null Hypothesis on the basis of information
contained in the sample when actually it is true is called Type-I error ( probability of
rejecting null Hypothesis when it is true) It is denoted by (In quality control it is called
producer’s risk because it is probability of rejecting a good lot).
Type-II Error: It is the probability of accepting the null hypothesis when it is false .
Also called consumer’s risk because it is prob of accepting bad lot. It is denoted by .
Hypothesis H0 and Hi are mutually exclusive events. i.e. if H0 is accepted (rejected) then
H1 is rejected (accepted).
Power of Test: The probability of accepting null Hypothesis on the basis of sample
information when null hypo is true is called Power of a test.
Therefore Power of Test: Prob. (Accept H0 when H0 is true)
Sample Space: Pop size= N, Random sample size drawn= n and possible samples are
k=Ncn
Suppose some statistic ‘t’ is computed from each of the samples.
t =f(x1,x2,x3,-----xn) Possible sample statistic are t 1,t2,t3 tk constitute sample space.
It is used to test null hypothesis. Some will lead to rejection of Ho other may lead to
acceptance of H0.
Thus sample space of statistic is divided into two disjoint and exhaustive sets.
Critical Region (W) : It is part of sample space which leads to rejection of null hypothesis
if given sample statistic fall in this region.
Acceptance Region: It is that part of sample space, which leads to acceptance of null
hypothesis, if sample statistic falls in it.
Critical Point: The point in sample space which divides the sample space in two mutually
disjoint and exhaustive sets is known as critical Point.
The critical points are tabulated values for different sampling distributions. Form of
Sample Space is determined by different sampling distribution like t, F, ,Z etc.
Sampling Distribution: Sampling distribution is Probability distribution of a statistic.
38
Two tailed and one tailed tests: A two tailed test rejects the null hypothesis if, say, the
sample mean is significantly higher or lower than the hypothesized value of the mean of the
population. Such a test is appropriate when the null hypothesis is some specified value and
the alternative hypothesis is a value not equal to the specified value of the null hypothesis.
One tailed test : A one tailed test would be used when we are to test, say, whether the
population mean is either lower than or higher than some hypothesized value.
For ex. If H0: = 0
H1: 0 then we are interested in what is known as left tailed test or if
H1: 0 then it is one tail test, which is known as right tailed test. (Where there is only
one rejection region either on the left tail or right tail)
Tests of Hypotheses:
Tests of hypotheses (also known as tests of significance) can be classified as:
1. Parametric Tests
2. Non-parametric Tests
Parametric Tests: Parametric tests usually assume certain properties of the parent
population from which sample is drawn. Assumptions like observations come from a normal
population, sample size is large, assumptions about the population parameters like mean,
variance etc. must hold good before parametric tests can be used. Probability distribution of
statistic (sampling distribution) is known i.e. it follows particular distribution like t, F, Z
etc. Parametric tests cannot be applied if nature of parent population is unknown and data is
measured on nominal/ ordering scale.
The important parametric tests are: z-test, t-test, - test and F-test. (-test is also used as
a test of goodness of fit and also as a test of independence in which case it is a non-
parametric test.)
All these test are based on the assumption of normality i.e. the source of data is considered
to be normally distributed.
Z-test: it is based on the normal probability distribution and is used for judging the
significance of several statistical measures, particularly the mean. The relevant test statistic,
Z, is worked out and compared with its probable value at specified level of significance for
judging the significance if the measure concerned.
As n becomes large Z-test is generally used even when binomial distribution or t-
distribution is applicable on the presumption that such a distribution tends to approximate
normal distribution.
Z- test is used for comparing the mean for the population, when pop. variance is known, for
judging the significance of difference between means of two independent samples when pop.
variance is known, for comparing the sample proportion to a theoretical value of population
or for judging the difference in proportions of two independent samples when n happens to
be large. This test may be used for judging the significance of median, mode, coefficient of
correlation and several other measures.
t-test: t-test is based on t-distribution and is considered an appropriate test for judging the
significance of a sample mean or for difference between means of two samples in case of
39
small sample (s) when pop. variance is not known (then sample variance is used for pop
variance.). In case two samples are related, we use paired t-test (difference test) for judging
the significance of mean of difference between two related samples. Also used for testing
the significance of the coefficient of simple and partial correlations. The relevant test statistic
t is calculated from the sample data and then compared with its probable value based on t-
distribution to read from table at different level of significance and degree of freedom for
accepting or rejecting the hypothesis.
-test: It is based on chi-square distribution and as a parametric test is used for comparing
a sample variance to a theoretical population variance.
= (Xi- X)2/2 = (n-1) S2/2 with n-1 d. f.
F-test: F-test is based on F-distribution and is used to compare the variance of the two –
independent samples. This test is also used in the context of analysis of variance (ANOVA)
for judging the significance of more than two sample means at one and the same time. It is
also used for judging the significance of multiple correlation coefficients. Test statistic, F,
is calculated and compared with its probable value for accepting or rejecting the null
hypothesis. (we use F-ratio Table for certain d.f. at certain level of significance)
Non-Parametric Tests: The tests which are used when practical data may be non normal
and /or it may not be possible to estimate the parameter(s) of the data are called non-
parametric tests. Since these tests are based on the data, which are free from distribution and
parameter, these tests are called non-parametric tests or distribution free tests. The non-
parametric tests can be used for nominal data (qualitative data, like greater or less etc.) and
ordinal data, like ranked data. These tests require less calculation, because there is no need
to compute parameters. Also these tests can be applied to very small samples, more
specifically during pilot studies in market research. Inference about the population can be
made by the non-parametric tests when assumptions of the standard methods cannot be
satisfied since the non-parametric tests involve no or less restricting assumptions when
compared to the parametric tests.
Main non-parametric tests are
1. One-sample tests
a. one sample sign tests
b. Chi-square test
c. Kolmogorov-Smirnov test
d. Run test for randomness
2. Two- Sample tests
a. Two-sample sign test
b. Median test
c. Mann-Whitney U test (Rank sum test)
3. K-sample test
a. Median test
b. Kruskal- wallis test (H test)
c. Kendall’s coefficient of concordance test
40
8. HOW TO WRITE RESEARCH PROPOSAL
Two main purposes: (i) to get a degree and (ii) conduct sponsored and Consultancy
research projects.
With increasing privatization of higher education and shrinking public grants,
greater stress of Academic Institutions is on generating their own resources.
Academic institutions need faculty capable of doing independent sponsored and
consultancy research projects—lead the research team
A good research proposal (RP) is not only necessary for a high quality of research
but also for getting grant from the funding agencies
A RP must be convincing to anonymous experts who examine it and see whether it
is methodologically sound, conceptually clear and would make significant
contribution to the knowledge on the subject.
As large No. of RP submitted to the funding agencies for financial assistance, your
proposal need to be excellent and not just very good for getting approved for the
grant.
A good RP presumes that you have already thought about your project and have
devoted some time and efforts in gathering information, reading and organizing
your thoughts
A high quality proposal not only promises success for the project, but also
impresses RDC about your potential as a researcher.
41
TYPES OF PROPOSALS
Letter proposal
– Preliminary expression of interest to an investor
– Developed into full proposal only with donor’s consent
– Most unsolicited proposals should first be in form of letter proposal
Full proposal
– Often in response to “Request For Proposal” - RFP
FEATURES OF A GOOD RP
42
Parsimony: simple, unambiguous, no jargons, able to convey what you want to do
and how you want to do.
The most important ideas are highlighted.
Consistency among objectives, hypotheses and title
Contains Executive summary
A detailed schedule of activities
Collaboration, if any clearly stated.
Follows all of the directions given in the proposal guidelines.
Appendices for for detailed and lengthy materials
The length consistent with the guidelines of funding agency
The budget and the proposal narrative are consistent.
The uses of fund are clearly indicated.
The qualifications and experience and credentials of PI and Co-PI mentioned
Identify broad area and then narrow down—follow general to particular approach
Study broadly—books, journals and reports: the more you read, the more likely
you will encounter a topic that interest you
Think when you read; most of the ideas come upon surprisingly while you are
reading. Think and think beyond
Be inclusive with your thinking: do not try to eliminate ideas too quickly. Build on
your ideas and see how many different research topics you can identify. Be
expensive in yr thinking at this stage—you would not be able to do later.
Write down your ideas: whenever you have a good idea, no matter how small and
how immature it may be, write it down and save it. Later on, when you check your
‘idea box’, you will be surprised to find how many brilliant ideas you alreadyhave.
Develop a topic that has interested you throughout your graduate or undergraduate
career
Think about the top three issues you want to study, then turn them into questions
Look at class notes; your teachers may have pointed out potential research topics
or commented on unanswered questions in the field
Talk with professors or advisors about possible topics
Study broadly to identify gaps in the literature
Get feedback on a potential topic from your advisor
Do research to discover why your topic has not been studied before .
43
WRITING THE PROPOSAL (TITLE)
The function of title is to encapsulate in a few words the essence of the research
Should have some key words reflecting variables, theoretical basis, and purposes,
time, place, etc.
Leave out phrases like “an investigation into”, “a study of”, “aspects of” as these
are obvious attributes of a research project
THE BACKGROUND
It must be clear from the text what the nature of the problem is , how it was
identified and why it is significant.
A problem statement should be presented within a context and the context should
be briefly explained, including a discussion on the conceptual framework.
The problem might be defined as the issue that exists in the literature, theory or
practice that leads to a need for the study.
The problem should be clearly defined, making the evaluation easy for the
reviewers/RDC members.
Effective problem statements answer the question: Why does this research need to
be conducted?
44
REVIEW OF LITERATURE
It provides the background and context for the research problem.
It shares with the readers the results of other studies that are closely related to the
proposed study.
It relates to the proposed study to the ongoing dialogue in the literature, filling in
gaps.
It provides a framework for establishing the importance of the study, as well as a
benchmark for comparing the results of the study with other findings.
Demonstrate to the reader that you have a comprehensive knowledge of the field.
Help to avoid statements that imply little has been done in he area.
Topic or problem area: This part of the literature review covers material directly
related to the problem being studied. separate substantive areas.
Theory area: Investigators must identify the theory which relates to the problem
area.
Purpose of the study should be clearly stated. If it is not clear to the writer, it
cannot be clear to the reader.
Briefly define the specific area of the research.
Foreshadow the hypotheses to be tested or the questions to be raised as well as
significance of the study. These will require specific elaboration in separate
sections.
Should incorporate rationale for the study.
Key points:
Start with “the purpose of the study is---”.
Clearly identify and define the central ideas of the study
Identify the specific method of inquiry to be used.
OBJECTIVES
Typically comes after problem definition, motivation and significance.
Start with overall objective (or aim of the research), and then state two to four
specific objectives
Write a list of the objectives of yr research. Think as many as u can. When u have
done it, consider each one carefully by asking the questions:
How will the objective be achieved—methods, resources, skills, time?
Is it realistically possible to achieve it?
What results are required to achieve it?
Is the objective central to yr study?
Are there any overlaps between the objectives?
Is there any sequence or hierarchy that link one to another. If so, are they in the
correct order?
Are there too many objectives to be realistically achievable?
45
HYPOTHESES
State what kind of relationships u expect to find between variables or factors.
Hypothesis is particularly necessary in the search for cause and effect relationship.
It is yr intelligent guess about the possible relationship. Do not hard pressto prove
that yr guess is right. It is more common to disapprove than prove the hypothesis.
A good hypothesis should possess the following features:
must be conceptually clear
should have empirical reference
must be specific
should be related to available techniques
Should be related to a body of theory
The RP must specify the research operations you will undertake and the way you
will interpret the results of these operations in terms of your central problem.
Do not just tell what you mean to achieve, tell how you will achieve.
A methodology is not just a list of research tasks but an argument as to why these
tasks add up to the best attack on the problem.
Indicates the methodological steps you will take to answer every question or test
every hypothesis.
The variables you propose to control and how—experimentally or statistically
Sampling design, data collection, data analysis techniques, Instruments, etc.
RELEVANCE/SIGNIFICANCE
Indicate how your research will refine, revise, or extend existing
knowledge.
How will results of the study be implemented, and what innovations will
come about?
POSSIBLE OUTCOME
46
Problems from reviewers’ point of view
Problem: They may not get the significance of your proposed research.
Solution: Write a compelling argument.
Problem: They may not be familiar with all your methods.
Solution: Write to the non-expert in the field.
Problem: They may not be familiar with your lab.
Solution: Show them you can do the job.
Problem: They may get worn out by having to read 10 to 15 applications in detail.
Solution: Write clearly and concisely, and make sure your application is neat, well
organized, and visually appealing. Leave out anything that is not absolutely
critical.
Does the title of the RP exactly identify and delineate the area of investigation?
Does the proposal give adequate reasons to show that the study will contribute: to
knowledge; development of theory in the subject; or to either theoretical or
practical methodology?
Does the proposed RP show: how the research will be structure; how the research
will phased and carried out; techniques & methods to be used and reasons for using
them; the proposed work is practicable and commence within time frame
47
CONVINCE THE REVIEWERS
You need to convince the selection committee that:
you have a good chance of attaining the goals with the resources available.
Every proposal reader constantly scans for clear answers to three questions:
What are we going to learn as the result of the proposed project that we do not
know now?
Why is it worth knowing?
How will we know that the conclusions are valid?
The opening paragraph, or the first page at most, is yr chance to grab the attention.
This is the moment to overstate, rather than understate, your point or question.
Questions that are clearly posed are an excellent way to begin a proposal
Most roposals are reviewed by multidisciplinary committees.
BAD: All previous studies are worthless because they failed to recognize the effect of X
on Y. Chen and Smith (1998) tried but their approach was simply wrong. Ours is the first
study to address this question correctly.
48
the effect of X on Y. A pioneering effort in this direction is described by Chen and Smith
(1998),
49
9. REFINING SKILLS IN BASIC STATISTICAL ANALYSIS
Objective
To give participants greater confidence in data analysis using statistical software.
Learning outcomes
On completion of the lecture, participant would be able to understand the following:
Beneficiaries:
Faculty from engineering, management, pure sciences, and social sciences who have been
actively engaging in guiding research scholars and carrying out sponsored research projects.
A basic understanding of statistics will be useful, but is certainly not essential.
Regression Analysis
It is a statistical method for studying the relationship between a single dependent variable
and one or more independent variables, with a view to estimating and/or predicting the
(population) mean or average value of the former in terms of known or fixed values of the
latter. When independent variable is only one, it is called bivariate analysis and when
independent variables are more than one, it is known as multivariate analysis.
Variables
Dependent Independent
Explained Explanatory
Predictand Predictor
Regressand Repressor
Endogenous Exogenous
Deterministic
Stochastic/ Probabilistic
50
if we believe that the model should be constructed to allow for random error, then we
hypothesize a probabilistic model. This includes both a deterministic component and a
random error component.
Y = a + bX (deterministic)
Y = a + bX + u (stochastic)
Ŷ = a + bX + u
∑e2 = ∑ (Y - Ŷ)2
=∑ (Y – a- bX)2
∂e2 / ∂a = - 2 ∑ (Y - a - bX) = 0
∂e2 / ∂b = - 2 ∑X (Y -a - bX) = 0
∑ (Y - a - bX) = 0, ∑Y - na - b ∑ X = 0,
∑Y = na -+ b ∑ X
∑XY = a ∑X + b∑X2
By solving these two normal equations, intercept and slope of linear regression line are
estimated. Values of a and b can directly be estimated by the following formulae:
= y – bx
Bivariate regression can be done manually. When the number of independent variables is
more than one, it becomes cumbersome to estimate coefficients manually. For this, special
computer packages are used to run the regression programme.
Assumptions
51
For each set of values for the K independent variables, (Xi1, X2j ------- Xkjj),
For each set of values for K independent variable, VAR (u j) = σ2 (i.e., the variance of
error term is constant) (homoscedasticity). Violation of the assumption creates the
problem of hetroscedasticity.
For any two sets of values for the K independent variable, COV (u i ,uj) = 0 (i.e., the
error terms are uncorrelated) (Autocorrelation problem).
For each Xi, COV (Xi, u) = 0 (i.e., each independent variable is uncorrelated with the
error terms) (Autocorrelation problem).
For each set of the value for the K independent variable, uj is normally distributed.
There are two uses of regression: prediction and causal analysis. In a prediction
study, the aim is to develop a formula for making predictions about the dependent
variable, based on the observed values of exogenous variables. For example, an
economist may want to predict next year’s GNP based on such variable as last year’s
GNP, current interest rates, current level of rate of investment and other variables.
It does two things: for prediction studies, it makes possible to combine many
variables to produce optimal predictions of the endogenous variable and for causal
analysis, it separates the effects of exogenous variables on the endogenous variables.
Sophisticated non-linear regression models are very complicated and require high level
of mathematical skill and specialized software.
In a bivariate analysis, the relationship between the two variables can easily be
identified by plotting the data on graph. If a straight line is formed, linear function is
used. It is difficult to know the linearity among the variable when the number of
exogenous variables are large. If the real relationship is non-linear, and the linear function
is used, the analysis may provide inefficient results. A useful general principle in science
is that when you do not know the true form of a relationship, start with something simple.
A linear equation is perhaps the simplest way to describe a relationship between two or
more variables and still get reasonably accurate
52
predictions. Furthermore, it is essay to modify the linear equation to represent certain
kinds of non-linearity.
The most common statistics for doing this is coefficient of determination (R 2). The basic
idea behind R2 is to compare two quantities:
The sum of squared errors produced by the least squares equation and
The sum of squared errors for a least squares equation with no independent variables
(just the intercept).
When an equation has no independent variables, the least squares estimate for the intercept
is just the mean of the dependent variable.
53
R2 = 1- {∑ (Y - Ŷ) 2 / ∑ (Yi - y)}
F = (ESS/ k-1)/ (RSS/ n-k) = (ESS/ k-1) X (n-k) / RSS = ESS (n-k)/RSS (k-1)
F = (n-k) (ESS/ TSS/ (k-1) RSS/ TSS) = (n-k) R2 / (k-1) (TSS – ESS/TSS)
In any regression analysis, we typically want to know something about the accuracy
of the numbers we get when we calculate estimates of regression coefficients. There are three
possible sources of error:
Measurement error: very few variables can be measured with perfect accuracy,
especially in social sciences.
Sampling error: In many cases, our data are only a sample from larger population
and the sample will never be exactly like the population.
Uncontrolled variations: there may be so many other variables that are not under
the control. They can disturb the relationship between the dependent and independent
variables included in the function.
The basic assumption is that the errors occur in a random and unsystematic fashion.
We evaluate the extent and importance of this random variation by calculating confidence
intervals or hypothesis tests.
Confidence intervals give us a range of possible values for the coefficients. Although
we may not be certain that the true value falls in the calculated range, we can be reasonable
confident. Hypothesis tests are used to answer the question of whether or not the true
coefficient is zero.
For instance, if b is 600 and SE is 210, then Confidence interval is 600 + (2 x 210)
= 1020, and 600 – (2 x 210) = 180. We can say that we are 95% confidence that the true
coefficient lies somewhere between 180 and 1020.
T- Statistics = b / SE
54
Then we consult a t table (or computer does this for us) to calculate the associated p value. If
p value is small, it is taken as evidence that the coefficient is not zero.
It is analogous to decision reached in court of law. Under the court system, a defendant is
brought to trail and he is assumed to be not guilty. For the judge or jury to reject the findings
of guilty, sufficient evidence must be produced. In the court system, error can be made,
innocent defendant can be found guilty and guilty individual cannot be found guilty. Under
a legal system where the evidence must show beyond a shadow of doubt that the assumption
of non-guilt is to be rejected, there is a primary concern for the influential error of the first
type i.e., of convicting an innocent person. Just as defendant is assumed not guilty until
proven guilty, in hypothesis testing, the null hypothesis is assumed true until there is
sufficient evidence that it is not true.
There are different ways of interpretation of results from different types of variables.
As discussed, there are three types of variables: interval scales, ordinal scales, and nominal
scales.
The following results indicates that the wrongly selected data and variable may provide
misleading results State-wise data for the year 2000-01 was collected to assess the impact
of technical education on economic development by some researchers. The results are
as follows:
Where:
55
PCI = per capita net state domestic products
GE = general education
56
It is bad.
It has something to do with high correlation among the independent
variables.
It comes in two forms: extreme and near-extreme. Extreme multicollinearity means
that at least two of the independent variables are perfectly correlated. The computer
easily detects this type of problem. Near-extreme multicollinearity means simply
that there are strong linear relationships among the exogenous variables. In the
presence of this problem, regression coefficients tend to have larger SE than they
would have been in its absence.
This problem can de identified by: (1) Estimating correlation among variables and if
value of R is 0.8 and above, there is severe problem, (2) fitting regression between
the two variable and if the R2 is 0.60 and above, there is problem., (3) tolerance value
i.e 1- R2 , it is above 0.4, there is problem, and finally (4) variance inflation factor
(VIF), ie. 1/ tolerance. The square root of VIFtells us how much larger the SE is,
compared with what it would be if that variable were uncorrelated with the other
variables.
Time series data, panel data and aggregate data are more prone to the problem. There
are various solutions: deletion of one or more variable from the model, combining
the collinear variables into an index, and performing joint hypothesis tests.
7. Hetroscedasticity problem
The word homoscedasticity is derived from a Latin phrase meaning ‘ same variance’.
Its opposite is hetroscedasticity which means that the degree of random noise in the
equation varies with the values of the x-variables. It can be checked by plotting the
data on graph. This problem has two effects:
Inefficiency: in the presence of this problem. OLS is not optimal as it gives
equal weights to all observations. Biased SE: in its presence, SE estimates
can be seriously biased. That in turn leads to bias in test statistics and
confidence interval.
Solutions: WLS and transforming data.
8. Auto-correlation problem
Auto-correlation or serial correlation refers to the case in which the residual error
terms from different observations are correlated. It can be caused by several factors,
including omission of an important explanatory variable or the use of an incorrect
functional form. Whatever the cause may be, it influences the outcome of the
hypothesis testing. Its effect is underestimating the SE of coefficients. This in turn
yields an inflated t-ratio, which means that it is possible that the coefficient will be
found to be significantly different from zero when in fact they are not.
This problem can be diagnosed by Darban Whatson test (DW test). In the presence
of autocorrelation, OLS is not efficient, GLS is preferred.
57
How do we run a regression?
How do we choose a computer package?
How do we get our data into the computer package?
What else should we do before running the regression?
How do we indicate which regression model to run?
How do we interpret computer output?
What are the common options in regression packages?
What are standardized coefficients and how are they interpreted?
58
10. Statistical Software: SPSS
Introduction
In this lecture, we shall attempt to make you aware of the main features of SPSS and to
enrich your skill in application of the software to with a view to analyse the data for
drawing meaningful results. As SPSS is very comprehensive and flexible software,
covering almost all aspects of data processing, cleaning, tabulating, analyzing and
reporting, it would not be feasible to discuss all these aspects in one session. Therefore,
the discussion will be limited only to some of the most relevant features of the software.
SPSS (Statistical Package for the Social Sciences) can take data from almost any
type of file and use them to generate tabulated reports, charts, and plots of distributions
and trends, descriptive statistics, and conduct complex statistical analyses. Our institute
has site license for this software that allows researchers to use the software in any general
access labs on campus. A brief note about the software is given in this write up. Detailed
discussions along with some practical examples will be made during the interactive
session on SPSS.
When you start SPSS for Windows, the first thing you will see is the data window. The
data window has a spreadsheet akin to Excel spreadsheet. You can directly enter data into
it. Cases (observations) are recorded in rows and variables in columns. You can cut, past,
and delete rows and columns as per the requirement.
The output window displays the output from statistical analyses and any charts you have
run. The table can be edited by double-clicking on the section of the table that you would
like to edit. Furthermore, they can be opened as a pivot table and edited from the pivot table
window so that one may adjust the look of the table.
There are two approaches to working in SPSS, using a point-and-click approach, and using
SPSS syntax to program commands and routines. The syntax window is used when data is
to be extracted from large databases. CSO and NSSO unit-level data are extracted through
syntax window. It is a very useful record-keeping tool. When using the point-and-click
approach, all commands and procedures are stored "in the background." This information
can then be pasted to the syntax window, using the "Paste" option found in the GUI interface
(i.e., the point-and-click approach). Having this information saved in the syntax window can
save a researcher abundant grief if printouts of output are lost. In addition, you can place
comments in the syntax window to indicate what it is that you are doing.
Syntax Editor window can be opened by doing the following: under File, select New, and
then Syntax. Keep in mind the following rules when:
Each command must begin on a new line and end with a period (.).
Most subcommands are separated by slashes (/). The slash before the first
subcommand on a command is usually optional.
Variable names must be spelled out fully.
59
Text included within apostrophes or quotation marks must be contained on a si ngle
line.
Each line of command syntax cannot exceed 80 characters.
A period (.) must be used to indicate decimals, regardless of your Windows regional
settings.
Variable names ending in a period can cause errors in commands created by the
dialog boxes. You cannot create such variable names in the dialog boxes, and you
should generally avoid them.
There are several default options in SPSS that you may find useful to change. You can edit
these options by going to the edit menu and selecting options. You will get a dialog box.
There are three main ways to get data into SPSS: (a) creating a new SPSS data file, (b)
opening existing SPSS data files, and (c) importing data from another source such as an
ASCII file, an Excel spreadsheet, etc.
Data can be directly entered into SPSS similar to an Excel spreadsheet. You may also cut
and paste data from other applications into SPSS. However, if you are going to enter data
directly, you will need to name and define your variables.
Opening existing SPSS files is simple procedure, similar to opening other Windows files.
Select "Open" from the File menu, and you will find a dialog box.
In order to use the data in SPSS, the data must be converted to a file format that SPSS can
recognize, namely something in *.sav format. SPSS can read in ASCII data, which can then
be saved in *.sav format.
SPSS allows the user to open data directly into SPSS from many different file formats.
For example, SPSS will directly open Excel, SAS, Lotus, and *.dbf (database) files. All
the user needs to do is to go to the File Menu, select "Open", "Data", select the correct file
type from the "Files of Type" drop down menu, and navigate to the file you wish to open.
60
6. Saving Data in SPSS
Saving data in SPSS is very similar to other Windows applications. Select "Save as" from
the File Menu, move to the directory in which you want to save the file, and give the file any
name you desire. SPSS allows you to give your files descriptive names, without having an
eight character restriction. The default file type in which the data file will be saved is *.sav.
If you wish to save it as another file type (i.e., Excel), simply change the file type in the "Save
as Type" drop down menu.
A good data set will include variable and value labels that provide a fuller description of
both the variable and the meaning of each value within a variable (for nominal and ordinal
data; value labels are not needed for continuous data). Unlike the variable names that are
limited to 8 characters, the label may be up to 120 characters long. They give a fuller
description of a variable. Value may be given like 1 for males and 2 for females for a variable
on gender.
Missing Values
SPSS has two types of missing values that are automatically excluded from statistics
computed by procedures: system-missing values and user-missing values. Any variable for
which a valid value cannot be read from raw data or computed is assigned the system-
missing value. User-missing values are values that you tell SPSS to treat as missing for
particular variables. These values are values (other than blanks) that you coded into your
data to indicate non-acceptable responses.
61
Statistical Analysis through SPSS
Now the selected variable appears in a box on the right and disappears from the left box.
Note that when a variable is highlighted in the left box, the arrow button is pointed right for
you to complete the selection. When a variable is highlighted in the r ight box, the arrow
button is pointed left to enable you to deselect a variable (by clicking the button) if necessary.
If you need additional statistics besides the frequency count, click the Statistics... button at
the bottom of the screen. When the Statistics... dialog box appears, make appropriate
selections and click Continue. In this instance, we are interested only in frequency counts.
The output appears on the Viewer screen
The mean, standard deviation, minimum, and maximum are displayed by default. The
variables are displayed, by default, in the order in which you selected them. Click Options...
for other statistics and display order. The following output will be displayed on the Viewer
screen.
The MEANS procedure displays means, standard deviations, and group counts for
dependent variables based on grouping variables. To run the MEANS procedure:
Select Mean, Number of cases, and Standard Deviation. Normally these options are
selected by default. If any other options are selected, deselect them by clicking them
Click Continue
Click OK
The output will be displayed on the Viewer screen.
SPSS displays the output in pivot table with cells divided with vertical lines. Sometimes,
the default width of the output table columns is not enough to fit the values that will be
inserted in the cells. To edit a pivot table, double-click the pivot table and this activates the
Pivot Table Editor. Or click the right mouse button on the pivot table and from the context
menu; choose SPSS Pivot Table Object/Open and the pivot table will be ready to edit in its
own separate Pivot Table Editor window.
62
Printing the Output
Once you are satisfied with your analysis you may want to obtain a hard copy of the output.
You may print the entire output on the viewer window, or delete the sections you do not
want before you print. Or you can save the output to a diskette or hard drive and print it later.
The SPSS data file contains the actual data, variable and value labels, and missing values that
appear in the SPSS Data Editor window.
Correlation analysis
Linear Regression
T-test
T-test is a data analysis procedure to test the hypothesis that two population means are equal.
SPSS can compute independent (not related) and dependent (related) t-tests. For
independent t-tests, you must have a grouping variable with exactly two values (e.g., male
and female, pass and fail). The variable may either be numeric or character. Suppose you
have a grouping variable with more than two categories. You may use the RECODE
(Transform/Recode) command to collapse the categories into two groups. RECODE is a
powerful SPSS command for data transformation with both numeric and string variables.
Select Variables
Select Grouping Variable.
Click on Define Groups...
Type 1 for Group 1, and 2 for Group 2.
A t-test with two related variables is performed using the Paired-Samples T-Test from the
Analyze/Compare Means menu.
The statistical technique used to test the null hypothesis that several population means are
equal is called analysis of variance. It is called that because it examines the variability in the
sample, and based on the variability, it determines whether there is a reason to believe the
population means are not equal. The statistical test for the null hypothesis that all of the
groups have the same mean in the population is based on computing the ratio of within and
between group variability estimates, called the F statistic. A significant F value only tells
you that the population means are probably not all equal. It does not tell you which pairs of
groups appear to have different means. To pinpoint exactly where the differences are,
multiple comparisons may be performed.
63
11. DATA ENVELOPMENT ANALYSIS TECHNIQUES
Introduction
Performance of any decision-making unit (DMU) largely depends on how
efficiently inputs are used in the production, marketing and distribution processes. As
resources at its disposal are limited and have competitive use, they are to be optimally applied
to enhance productivity, efficiency and profitability. In order to survive in today’s
competitive environment, it has to improve its performance not only relative to its past
performances but also relative to its competitors in the industry. In this context, it becomes
vital to study inter-firm comparison to identify best practices of efficient firms in resource
utilization and apply them to improve the efficiency of relatively less efficient firms.
In order to identify up to what extent a firm produces output efficiently and cost-
effectively, its economic efficiency is estimated. Economic efficiency is the product of two
efficiencies-technical efficiency and allocative efficiency. Technical efficiency refers to ‘the
firm’s ability to produce the maximum possible output from a given combination of inputs
and technology, regardless of market demand and prices’. Allocative efficiency refers to
the firm’s ability to use the inputs in optimal proportion, given their respective prices.
Classical production theory assumes that given the level of technology, a production
function shows maximum quantity of output that a firm can produce with the given set of
inputs. This means that the firm produces output with 100 per cent technical efficiency.
However, in reality, a firm’s realised output may be below the potential output. Hence,
measurement of individual firm’s technical efficiency becomes essential to know the extent
of deviation of firm’s actual output from its potential output. There are two most popular
approaches to estimate technical efficiency—Data Envelopment Analysis (DEA) and
Stochastic Production Frontier (SPF) Analysis. In this lecture, a detailed discussion will be
held on the DEA and the efficiency estimating procedure will be taught through DEA
software.
Genesis of DEA
Farrell (1957) laid the foundation for new approaches to efficiency and productivity
analysis at the micro level, involving new insights on two issues: how to define efficiency
and productivity, and how to calculate the benchmark technology and the efficiency
measures. He showed how to define economic efficiency and how to decompose it into its
technical and allocative components. He defines technical efficiency as the ratio of observed
output to the maximum potential output that can be attained from given inputs. If a firm’s
actual output is below the potential output, the shortages is regarded as an indicator of
inefficiency. Allocative efficiency (AE) of a firm is defined as the ratio of minimum cost to
the actual cost. It refers to the firm’s ability to use the inputs in optimal proportion, given the
prices of inputs.
Farrell’s paper gave birth to two approaches of efficiency measurement—
deterministic frontier approach and stochastic frontier approach (SFA). Deterministic
frontiers are parametric as well as non-parametric. Aigner and Chu (1968), Afriat (1972),
Richmond (1974), and Schmidt (1976) develop parametric deterministic models, while
Charnes, Cooper and Rhodes (1978) evolve a non-parametric deterministic approach,
popularly known as Data Envelopment Analysis (DEA) which is extended by Banker,
Charnes, and Cooper (1984). SFA is developed independently by Aigner, Lovell and
Schmidt (1977) and Meeusen and Broeck (1977) and later on extended by Jondrow, Lovell,
Materov, and Schmidt (1982) and Battese and Coelli (1992; 1995). Both DEA and SFA are
being applied by the researchers to measure technical efficiency of decision- making units
(DMUs) using cross-sectional as well as panel data. Earlier, economists
64
usually prefer to use econometric methods to measure efficiency. In the 1990s, many of
them have also started using DEA because of its ability to handle multiple inputs and outputs
and its suitability for studying the performance of both manufacturing and service sectors’
DMUs.
relationship, and ui 0 is one side error term representing technical inefficiency in the
sense that it measures the shortfall of output (yi) from its maximal possible value given by
the stochastic frontier [f(xi + vi]. The model (2) is known as SFPF because the output values
are bounded above by the stochastic (random) variable exp ( xi + vi). The random error vi
can be positive or negative (Coelli, et al., 1998)
Direct estimates of the stochastic frontier model can be obtained either by
maximum likelihood or corrected ordinary least square (COLS) methods. Introducing
specific probability distributions for vi and ui, assuming that ui and vi are independent and
that xi is exogenous, the asymptotic properties of the maximum likelihood estimators can
be obtained. The model can also be estimated by COLS by adjusting the constant term by
E(ui), which is derived from the moments of the OLS residuals. Once a model of this form
is estimated, one can readily obtain residuals i = yi - f(xi, ), which can be regarded as
estimates of the error terms i .
Meeusen et al. (1977) assign an exponential distribution to u, Battese and Corra
(1977) assign a half normal distribution to u, and Aigner et al. (1977) consider both
distributions for u. Parameters to be estimated are , v2 and variance parameter u 2
associated with u. Either distributional assumption on u implies that the composed error (v
- u) is negatively skewed and statistical efficiency requires that the model be estimated by
maximum likelihood. After estimating production frontier, an estimate of mean technical
inefficiency in the sample is provided by E (-u) = E (v - u) = - (2/)1/2 u in the normal- half
normal case and by E (-u) = E (v- u) = -u in the normal-exponential case.
SFA approach gives less biased measure of efficiency. However, it could only
provide average technical efficiency measures for the sample observations. Although these
aggregate measures are useful in a way, individual observation- specific technical efficiency
measures are more useful from a policy viewpoint. Jondrow, Lovell, Materov
65
and Schmidt (1982) and Kalirajan and Flinn (1983) independently considered the Aigner
et al. (1977) and Meeusen and van den Broeck (1977) stochastic models to predict the
random variable ui under the assumption that i is known. SFA does not have a priori
justification for the selection of any particular distribution form of the random error term
and resulting efficiency measures may be sensitive to the distributional assumption.
Another problem with SFA is that it cannot handle multiple output variables at a time
(Thanassoulis, 2001).
is often inadequate due to the existence of multiple inputs and outputs related to different
resources, activities and environmental factors. DEA methodology is developed to solve
this problem. This technique is quite useful for measuring the efficiency of service sector
DMUs, especially the government organization providing public goods.
We have two basic DEA models—CCR model, developed by Charnes, Cooper and
Rhodes in 1978 and BCC model, developed by Banker, Charnes, and Cooper in 1984. CCR
model generalises the single output/input ratio measure of efficiency for a single DMU in
terms of fractional linear programming (FLP) formulation transfroming the multiple
output/input characteristics of each DMU to that of a single “virtual” output and “virtual”
input. The model defines the relative efficiency for any DMU as a weighted sum of outputs
divided by a weighted sum of inputs where all efficiency scores are restricted to lay between
zero and one. An efficiency score less than one means that a linear combination of other
units from the sample could produce the same vector of outputs using a smaller vector of
inputs. The score reflects the radial distance from the estimated production frontier to the
DMU under consideration. Variables in the model are input- output weights and the LP
solution produces the weights most favourable to the unit under reference. In order to
calculate efficiency scores, FLP is converted into LP by normalising either the numerator or
the denominator of the fractional programming objective function. In case of output –
maximization DEA program, the weighted sum of inputs is constrained to be unity to
maximize weighted sum of outputs, while in input-minimization DEA program, the
weighted sum of outputs is constrained to be unity to minimize weighted sum of inputs.
CCR model is based on constant returns to scale assumption. Under this assumption, if the
input levels of a feasible input-output correspondence are scaled up or down, then another
feasible input-output correspondence is obtained in which the output
66
levels are scaled by the same factor as the input levels (Thanassoulis, 2001).
Another version of DEA was given by Banker, Charnes and Cooper (1984). The
primary difference between BCC and CCR models is the convexity constraint, which
represents the returns to scale. The CCR model is based on the assumption that constant
return to scale exists at the efficient frontiers whereas BCC assumes variable retunes to scale
frontiers. CCR efficiency is overall technical efficiency (OTE), known as global technical
efficiency whereas BCC efficiency is the pure technical efficiency (PTE) net of scale-effect,
known as local technical efficiency. If a DMU scores value of both CCR- efficiency and
BCC-efficiency one, it is operating in the most productive scale Size (MPSS). If a DMU has
BCC-efficiency score one and CCR-efficiency score less than one, it is operating locally
efficiently but not globally efficiently due to the scale size of the DMU. Thus, inefficiency
in any DMU may be caused by the inefficient operation of the DMU itself (BCC-
inefficiency) or by the disadvantageous conditions under which the DMU is operating (scale-
inefficiency). Scale efficiency is estimated by dividing the CCR- efficiency from the BCC-
efficiency for a DMU. Another technique based on DEA is Malmquist Productivity Index
(MPI) proposed by Caves, et al. in 1982. The MPI is defined with distance functions. For
panel data, distance functions permit to describe multiple input-output production
technologies without behavioural objectives such as profit maximisation or cost
minimisation. The detail description of the MPI model is presented in the chapter 7.
Terminology of DEA
67
given quantity and combination of inputs (productivity) for a group of similar
organisations.
3. Decision Making Unit (DMU): The term DMU is first used by Charnes, Cooper and
Rhodes in 1978 in their seminal paper on DEA. DMU means individual production unit
producing tangible or intangible output under private, cooperative, government or any
other organization’s ownership. It comprises manufacturing firms, banking and
insurance companies, transport and communication firms, hospitals, schools and
universities, other service providing firms, government organizations, local
governments, municipal corporations, etc. For measuring the relative performance of
individual DMUs, the set of DMUs should face the same fundamental characteristics in
terms of environment and technological constraints. If someone wants to assess the
efficiency of educational institutions, the DMUs in the dataset should be homogeneous.
For instance, school cannot be compared with universities.
4. Economies of Scale: It refers to increasing a firm’s size until it obtains the minimum
cost per unit of output.
5. Inefficiency: The amount by which a firm lies below the estimated frontier can be
regarded as measure of inefficiency. Under the given technology, if actual output of a
firm equals the potential output, the firm would not have inefficiency in the production.
6. Most Productive Scale Size (MPSS): It is that size at which a DMU obtains 100 percent
pure technical efficiency and scale efficiency. This is possible when a DMU attains an
efficiency score of one under constant returns to scale technology assumption.
7. Pareto Efficiency: A DMU is Pareto-efficient if it is not possible to reduce any one of
its input levels without increasing at least another one of its input levels and /or without
lowering at least one of its output levels.
8. Peer: A peer is an efficient DMU which acts as a reference point (in terms of input and
output mix) for inefficient DMUs.
9. Productivity: It can be defined as the ratio of a measure of output of one or more of
inputs used to produce the output. There are two main concept of productivity: partial
(single) factor productivity and total (multiple) factor productivity. Partial factor
productivity is a simple ratio of volume of total output to the volume of total quantity of
a single input. For instance, labour productivity is measured by dividing the total
production of a firm by the number of total workers (or total hours of work) of that firm.
Partial factor productivity concept cannot provide the true performance of a resource.
For instance, labour productivity in a firm can be raised either by improving the quality
of human resource through training and retraining or simply by retrenching the
manpower and using more capital and technology intensive production process.
Therefore, total factor productivity (TFP) index is measured to assess the overall
productivity of a firm or industry. TFP is a ratio of weighted sum of output to the
weighted sum of inputs. The TFP index having value greater than one indicates to the
positive growth in the productivity and a value of TFP index less than one means negative
growth. If value of the index is equal to one, there is no growth in the productivity.
Various methods have been developed to compute TFP. In this study, we apply a non-
parametric DEA-based method, known as MPI to measure the TFP growth in the sugar
mills.
10. Production Frontier: Production frontier is what it gives maximal output that can be
achieved with the given amount of inputs.
68
11. Returns to Scale: It refers to a measure of change in output resulting from a change in
the scale of a firm’s operation as determined by its input usage. There are three returns
to scale—increasing, constant and decreasing. When inputs are doubled and output
increases more than double, it is increasing returns to scale. If the output increases in the
same proportion as inputs are increased, it is constant returns to scale. Decreasing returns
to scale exists when output increases less than the proportional increase in the inputs.
12. Pure Technical Efficiency: It refers to the proportion of technical efficiency which is
attributed to the efficient conversion of inputs into output. Effect of size of plant on the
efficiency is neutralized in it. It is also known as managerial efficiency or local
efficiency. It is estimated through BCC DEA model which is based on the variable
returns to scale technology assumption. Value of pure technical efficiency score lies
between zero and one.
13. Technical Efficiency: Technical efficiency refers to the firm’s ability to produce the
maximum possible output from a given combination of inputs and technology. In DEA,
technical efficiency is determined by the difference between the observed quantities of
a DMU’s output (s) to input (s) and the ratio achieved by best practice DMUs. It is,
therefore, a relative technical efficiency, not the absolute technical efficiency. Its value
lies between zero and one. If a DMU is on the production frontier and does not have any
input or output slack, its technical efficiency score will be equal to one. Technical
efficiency can be decomposed into scale efficiency and pure technical efficiency.
14. Scale Efficiency: The extent to which an organization can take advantage of returns to
scale by altering its size towards optimal scale. In DEA analysis, scale efficiency for a
DMU is calculated by dividing CCR efficiency score from BCC efficiency score. As
BCC score is more than or equal to CCR score, value of scale efficiency score lies
between zero and one.
15. Slacks: Slacks in DEA refer to the extra quantity by which an input (output) can be
reduced (increased) to obtain technical efficiency after all inputs (outputs) have been
radially reduced to reach the production frontier.
DEA MODELS
Basic DEA models are described as:
CCR Model
This model generalizes the usual input/output ratio measure of efficiency for a given
firm in terms of a fractional linear program formulation. Mathematically, the relative
efficiency of the kth DMU is given by:
u rk y rk
Max h k = r 1
(1)
vik xik
m
i1
Subjected to:
u
s
rk y rj
r1
m
1 j = 1…. k…. n
v ik xij
i1
69
urk r = 1…... s
m
u ik xik
i1
vik
m i=1… ..... m
v ik xik
i1
Where:
y rk = the amount of the r th output produced by the k th DMU; x = the amount of the
ik
ik xxk 1
i 1
s m
rk r 1 ..... s
ik i 1 ..... m
Since the number of DMUs is generally larger than the total number of inputs and
outputs, solving the dual of the model can reduce the computational burden.
Mathematically, the dual formulation of the above model is:
Min z = S S
s m
(3)
k k rk ik
r 1 i 1
Subjected to
n
jk rj rk rk
y S y r 1 ......... s
j1
S x i 1 ......... m
n
x
jk ij ik k ik
j 1
jk 0 j 1 ........ n
k free
S , S 0 ;r 1.....s, i 1 .... m
rk ik
Where:
70
S = Slacks in the i th input of the k th DMU; S = slacks in the r th output of the
rk ik
applied to all inputs of DMU k to impose efficiency. If for DMU k, * k =1 and all slacks
are zero, it is Pareto efficient. The non-zero slacks and (or) *k 1 identify the sources and
amount of any inefficiency that may exist in the DMU under reference.
71
BCC Model
The primary difference between BCC model and CCR model is the convexity
n
constraint. In the BCC model jk s are restricted to summing to one (i.e. jk =1). If we
j1
n n
n n
Returns to Scale (NIRS) model. Similarly if we impose jk 1 instead of jk =1,
j1 j1
X CRS
e
m VRS
d
l
q c
h i k
b
.
o a r n
Y
Figure 1: Comparison of CRS and VRS Frontiers
Figure-1 makes the comparison of CCR and BCC models. The CCR model is based
on constant returns to scale (CRS) technology assumption and the BCC model is based on
variable returns to scale (VRS) technology assumption. The CRS surface is the straight-line
oicm and the VRS surface is abcde.
72
Efficiency of any interior point (such as ‘k’) is intuitively given by the distance
between the envelope and itself. Typically, such a distance may be measured either
horizontally along the x-axis or vertically along the y-axis, providing an input-oriented or
output-oriented measure, respectively. For example, using an input orientated measure,
technical efficiency of DMU ‘k’ will the measured by hi/hk in the CRS technology
assumption and by hj/hk in the VRS technology assumption. A measure of scale efficiency
is providing by the ratio hi/hj. A DMU at point ‘c’ is operating at most productive scale size
(MPSS).
Precautions to be taken
Post-DEA Analysis
A key aspect of DEA is incorporating environmental factors into the model as either inputs
or outputs. Resources available to units are classed as inputs whilst activity levels or
performance measures are represented by outputs. One approach to incorporating
environmental factors is to consider whether they are effectively additional resources to
73
the unit in which case they can be incorporated as inputs, or whether they are resource users
in which case they may be better included as outputs. For example in comparing efficiency
of schools research has indicated that in general parents of higher educational attainment
provide greater support to their children and therefore are effectively an additional resource
to the schools and should be classed as an input. Tobit regression is an appropriate method to
study the impact of environmental and background factors on efficiency. It assumes that the
data are truncated, or censored, above or below certain values. In DEA, values of dependent
variable are censured as they range between 0 and 1.
Introduction
74
improvement in technical efficiency (catch up). The approach has become quite popular
because: (i) it does not require price data, therefore suitable when price data are not available
or price data are distorted; (ii) it rests on much weaker behavioral assumptions, since it does
not assume cost minimizing or revenue maximizing behaviour; (iii) it uses panel data and
provides a decomposition of productivity change into two components— technical change
and technical efficiency change. Technical change reflects improvement or deterioration in
the performance of best practice firms, while technical efficiency change reflects the
convergence toward or the divergence from best practice on the part of the remaining firms.
The significance of the decomposition is that it provides information on the source of overall
productivity change in the firms.
t t 1 t 1 t 1
( yt 1t , xtt1 ) ]1 2 (1)
M 0t1 ( y t 1 , x t 1 , y t , x t ) [ D0 (t y t , x t ) * D0 t 1
D ( y ,x ) D (y,x )
Equation (1) is the geometric mean of the tw0o indices—technical efficiency change
0
and technical change. The first is estimated with respect to period t technology and second
with respect to period t+1 technology. Assuming that Dt0 ( yt , xt ) 1 and
D0t ( yt 1 , xt 1 ) 1, equation (7.1) can be rewritten as
t t t
t 1 t 1 Dt 1 ( yt 1 , xt 1 ) Dt ( 0yt 1 , xt 1 ) D ( y ,0 x ) 1 (2)
M0 ( y ,x , y , x ) 0
t1 t t
[ * ] 2
t t t t 1 t 1 t 1 t 1 t t
D 0 (y , x ) D (0 y , x ) D ( 0y , x )
Where, the ratio outside the square brackets in equation (7.2) represents technical
efficiency change (effch) and the expression in the square brackets indicates technical
change (techch). Thus, MPI can be decomposed into change in technical efficiency
(catching up) and into change in frontier (technical progress):
75
Dt01 ( yt 1 , xt 1 ) (3)
effch
Dt 0( yt , xt )
t t 1 t 1 Dt ( yt , xt ) 1
techch [ D 0( y , x ) * t 0 ] 2 (4)
1 t t
Dt 1 ( yt 1 , xt 1 ) D ( y , x )
Technical e0fficiency change 0 (effch) measures the change in technical efficiency
between periods t and t+1 with respect to the production possibilities existing in each period.
Technical change (techch) is the geometric mean of the shifts in frontier at the factor ratios
of periods t+1 and t respectively. The value of the MPI greater than 1 means productivity
growth and a value less than 1 means deterioration in productivity. The same is applicable
to each of the components of the Malmquist Productivity Index.
Figure-1 describes the MPI with one input (x) and one output (y) under CRS
technology and its decomposition into efficiency change and technical change, MPI under
CRS technology indicates a rise in potential productivity as the technology frontier shifts
from t to t+1. Points P and R in the figure represent the input-output combinations of a
production unit (Mill) in periods t and t+1respectively. In both periods, the unit is operating
below the production possibility frontier.
y
Frontier i n
period t+1
Y3
yt+1 Frontier in
Y2 period t
Y1
Yt
0 Xt Xt+1 x
Technical efficiency change and technical change are represented by the distance
functions. In terms of the distances along the y-axis, the index becomes
Mt+1(y t+1, x t+1, yt, xt ) = (y t+1/ Y3) / (yt / Y1) [(yt+1/ Y2) /( yt+1/ Y3) * [(yt / Y1)/( yt / Y2)]1/2
(7.5)
Technical Change = [(yt+1/ Y2) /( yt+1/ Y3) * [(yt / Y1)/( yt / Y2)]1/2 (7.7)
76
In order to calculate the productivity of the year between t and t+1, we need to solve
four different LP problems: D t ( x t, y t), D t+1( x t, y t ), D t ( x t+1, y t+1 ), and D t+1( xt+1, y t+1
). Mathematical formulations are shown in Box 1. If technical efficiency change is to be
decomposed into scale efficiency change and pure technical efficiency change, two more LP
problems are to be solved by putting the convexity restriction in (7.8) and (7.9), that is, one
would estimate these two distance functions relative to VRS technology
(Coelli et al., 1998).
Box-1
Linear Programming Formulation of MPI
[d X ,Y max
,
subject to
( 7.8)
[d X ,Y max
,
subject to
(7.9)
[d X ,Y max
,
subject to
(7.10)
[d X ,Y max
,
subject to
(7.11)
76
12. ADVANCED MULTIVARIATE ANALYSIS
In this lecture, we shall discuss two advanced topics of multivariate analysis. They are
discriminate analysis and factor analysis
Discriminant Analysis
Researchers often wish to classify people or objects into two or more groups. One
might need to classify persons as buyers or non-buyers, good or bad credit risks or superior,
average or poor performers in some activity. The objective is to establish a procedure to find
the predictors that best classify subjects.
Discriminant analysis joins a nominally scaled criterion or dependent variable with
one or more independent variables that are interval or ratio scaled. Once the discriminant
equation is found, it can be used to predict the classification of a new observation. The
researchers may be interested to check whether the predictor variables discriminate among
the group. More specifically, it is essential to identify which independent variable is more
important when compared to other predictor variables. This is done by calculating a linear
function.
Discriminant function analysis, known as discriminant analysis (DA) is used to
classify cases into the values of a categorical dependent, usually a dichotomy. It is applied
when grouping variable has only two categories. Multiple discriminant analysis (MDA) is
used to classify a categorical dependent that has more than two categories. MDA is
sometimes also called discriminant factor analysis or canonical discriminant analysis.
DA shares all the usual assumptions of correlation, requiring linear and homoscedastic
relationship. Like multiple regression, it also assumes proper model specification (inclusion
of all important independents and exclusion of extraneous variables). It also assumes the
dependent variable is a true dichotomy.
Objectives of DA
The criterion variable. This is the dependent variable, also called the grouping
variable.
77
analogous to multiple regression, but the b's are discriminant coefficients which
maximize the distance between the means of the criterion (dependent) variable.
The eigenvalue, also called the characteristic root of each discriminant function,
reflects the ratio of importance of the dimensions which classify cases of the
dependent variable. There is one eigenvalue for each discriminant function. The
eigenvalues assess relative importance because they reflect the percents of variance
explained in the dependent variable, cumulating to 100% for all functions.
The canonical correlation, R*, is a measure of the association between the groups
formed by the dependent and the given discriminant function. When R* is zero, there
is no relation between the groups and the function. When the canonical correlation is
large, there is a high correlation between the discriminant functions and the groups.
R* is used to tell how much each function is useful in determining group differences.
An R* of 1.0 indicates that all of the variability in the discriminant scores can be
accounted for by that dimension.
The discriminant score, also called the DA score, is the value resulting from
applying a discriminant function formula to the data for a given case. The Z score is
the discriminant score for standardized data.
Unstandardized discriminant coefficients are used in the formula for making the
classifications in DA, much as b coefficients are used in regression in making
predictions. The constant plus the sum of products of the unstandardized coefficients
with the observations yields the discriminant scores. That is, discriminant
coefficients are the regression-like b coefficients in the discriminant function, in the
form L = b1 x1 + b2x2 + ... + bnxn + c, where L is the latent variable formed by the
discriminant function, the b's are discriminant coefficients, the x' s are discriminating
variables, and c is a constant. There will be no constant when the data are
standardized or are deviations from the mean. The discriminant function coefficients
are partial coefficients, reflecting the unique contribution of each variable to the
classification of the criterion variable. The standardized discriminant coefficients,
like beta weights in regression, are used to assess the relative classifying importance
of the independent variables.
78
ANOVA table for discriminant scores is another overall test of the DA model. It is an
F test, where a "Sig." p value < .05 means the model differentiates discriminant scores
between the groups significantly better than chance (than a model with just the
constant).
(Variable) Wilks' lambda also can be used to test which independents contribute
significantly to the discrimiinant function. The smaller the variable Wilks' lambda for
an independent variable, the more that variable contributes to the discriminant
function. Lambda varies from 0 to 1, with 0 meaning group means differ (thus the
more the variable differentiates the groups), and 1 meaning all group means are the
same. The F test of Wilks's lambda shows which variables' contributions are
significant. Wilks's lambda is sometimes called the U statistic. In SPSS, this use of
Wilks' lambda is in the "Tests of equality of group means" table in DA output.
Method of Estimation
Where
Di is the score on discriminate function i; the Xs are the values of the discriminating
variable used in the analysis; di s are weighting coefficients; and d0 is constant.
A single discriminant equation is required if the categorization calls for two groups. If three
groups are involved, it requires two discriminant equations. If more categories are called for
in the dependent variable, it is necessary to calculate a separate discriminant function for
each pair of classification in the criterion group. Here we shall describe two- group DA.
Let X1 and X2 be the predictor variables; G1 and G2 two groups and n1 and n2 number of
set of observations in G1 and G2, respectively.
Calculation Process
1. Find the mean of X1 and X2. Let 1(G1) be the mean of X1 and 2(G2) be the mean of
X2
in Group-2. Also find the aggregate mean of X1 and X2..
2. In each group, find X 2, X 2 and X X
1 2 1 2.
3. Define the linear composition as Di = d1X1 + d2X2 and find the value of d1 andd2 by
solving the following normal equations.
The sum of squares in the above normal equation can be substituted with the following
simple formula.
(X1 - 1)2 = (X1 – 1 (G1))2 + (X1– 2 (G2))2
79
(X2 – 2)2 = (X2 – 2 (G1))2 + (X1– 2 (G2))2
[(X1 – 1) (X2 – 2)] = [(X1 – 1(G1)) (X2– 2 (G1))] + [(X1 – 1(G2)) (X2– 2 (G2))]
where
(X1 – 1 (G1))2 = X 2 – n1 2
(X1 – 1 (G2))2 = X 2 – n2 2
1 1 (G1)
1 1 (G2)
(X2 – 2 (G1))2 = X2 2 – n1 2 2 (G1)
2 2 2
(X2 – 2 (G2)) = X – n2
2 2 (G2)
(X1 – 1 (G1)) (X2 –2 (G1) = X1 X2 - n11(G1) X2(G1)
(X1 – 1 (G2)) (X2 –2 (G2) = X1 X2 - n21(G2) X2(G2)
4. In each group, find the discriminate score for each combination of the variables X1 and
X2. Then find the average of the discriminate scores of each group and also the grant mean
of the discriminate scores for the entire problem.
5. Find the variability between group (VBG) using the following formula:
Where S1 and S2 are the means of the discriminant scores in the group-1 and group-2,
respectively and S is the aggregate mean of the entire problem
.
6. Find the variability within group VWG using the following formula:
n1 n2
VWG = (S1j- S1) + (S2j - S2)2
2
j=1 j=1
where S1j and S2j are the discriminant scores for the jth set of observations in the group-1
and group-2, respectively; S1 and S2 are the mean of discriminant scores of group-1 and
group-2.
Example
The director of a management school wants to do discriminate analysis concerning the effect
of two factors, namely, the yearly spending (Rs.lakh) on infrastructure of the school (X1)
and the yearly spending on interface events of the school (X2) on the grading of the school
by an inspection team. The data are given below:
Table-1
year Grade Expenditure on Expenditure on
infrastructure interface events
(Rs lakh) X1 (Rs lakh) X2
1993-94 Below average 3 4
94-95 Below average 4 5
95-96 Above average 10 7
96-97 Below average 5 4
80
97-98 Below average 6 6
98-99 Above average 11 4
99-00 Below average 7 4
00-01 Above average 12 5
01-02 Below average 8 7
02-03 Below average 9 5
03-04 Above average 13 6
04-05 Above average 14 8
Below = 0 and above = 1
Calculation process
Table-2
year Grade Expenditure on Expenditure on
(Group-1) infrastructure interface events
(Rs lakh) X1 (Rs lakh) X2
1993-94 0 3 4
94-95 0 4 5
96-97 0 5 4
97-98 0 6 6
99-00 0 7 4
01-02 0 8 7
02-03 0 9 5
Total 42 35
Mean 6 5
Table-3
year Grade Expenditure on Expenditure on
(Group-2) infrastructure interface events
(Rs lakh) X1 (Rs lakh) X2
95-96 1 10 7
98-99 1 11 4
00-01 1 12 5
03-04 1 13 6
04-05 1 14 8
Total 60 30
Mean 12 6
Aggregate 8.5 5.41666
Mean
(G-1+G-2)
Table-4
81
02-03 0 9 5 81 25 45
Total 42 35 280 183 217
Table-5
Table-6
Sum of squares Below Above total
(X1 - 1)2 = (X1 – 1 (G1))2 28. 10 38
+ (X1– 2 (G2))2
Discriminate Function
Di = d1X1 + d2X2
Normal Equations
d1 38 + 11 d2 = 12 – 6 = 6
d1 11 + 118 d2 = 6 – 5 = 1
Di = 0.17229X1 – 0.04973X2
82
Mean score for G-2 = 0.17229 x 12 – 0.04973 x 6
= 1.7691
Group-1 Group-2
year Discriminate (S1j- S1)2 Year Discriminate (S1j- S2)2
score (S1j) score (S1j)
1993-94 0.31795 0.218220 95-96 1.37479 0.155480
94-95 0.44051 0.118755 98-99 1.69627 0.005304
96-97 0.66253 0.015021 00-01 1.81883 0.002473
97-98 0.73536 0.002473 03-04 1.94139 0.029684
99-00 0.00711 0.049293 04-05 2.01422 0.060084
01-02 1.03021 0.060084
02-03 1.30196 0.267155
Total 5.49563 0.730981 8.84550 0.253025
Mean 0.78509 (G-1) 1.7691(G-2)
Mean 1.195094
(Aggregate)
Discriminate Ratio
K = VBG / VWG
= 2.824137 / 0.984006 = 2.87
This is the maximum possible ratio between ‘the variability between groups’ and ‘the
variability within group’. In the discriminate function, the coefficient X2 has negative sign
which indicates that the variable X1 (spending on infrastructure) is more important than the
variable spending on interface events.
83
FACTOR ANALYSIS
Introduction
Factor analysis can simultaneously manage over a hundred variables, compensate for
random error and invalidity, and disentangle complex interrelationships into their major and
distinct regularities. It takes thousands of measurements and qualitative observations and
resolves them into distinct patterns of occurrence. It makes explicit and more precise the
building of fact-linkages going on continuously in the human mind. It is a means by which
the regularity and order in phenomena can be discerned.
Factor analysis assumes that the observed variables are linear combinations of some
underlying (hypothetical or unobservable) factors. Some of these factors are assumed to be
common to two or more variables and some are assumed to be unique to each variable. The
unique factors are assumed to be orthogonal to each other. They do not contribute to the co-
variation among the observed variables (Kim & Mueller, 1983: 8). As in other multivariate
analysis, in factor analysis too, we are concerned with the variance. We want to know how
big it is and where it is. The purpose of this technique is to examine which variables have
what amount of variance in common.
Many statistical methods are used to study the relation between independent and
dependent variables. Factor analysis is different; it is used to study the patterns of
relationship among many dependent variables, with the goal of discovering something about
the nature of the independent variables that affect them, even though those independent
variables were not measured directly. Thus answers obtained by factor analysis are
necessarily more hypothetical and tentative than is true when independent variables are
observed directly.
A factor analysis usually begins with a correlation matrix. It can also use co-
variances. Without getting deeply into the mathematics, we can say that factor analysis
attempts to express each variable as the sum of common and unique portions. The common
portions of all the variables are by definition fully explained by the common factors, and the
unique portions are ideally perfectly uncorrelated with each other. The degree to which a
given data set fits this condition can be judged from an analysis of what is usually called the
"residual correlation matrix".
A typical factor analysis suggests answers to four major questions:
1. How many different factors are needed to explain the pattern of relationships
among these variables?
2. What is the nature of those factors?
3. How well do the hypothesized factors explain the observed data?
4. How much purely random or unique variance does each observed variable include?
84
Uses of Factor Analysis
The main applications of factor analytic techniques are: (1) to reduce the number of
variables and (2) to detect structure in the relationships between variables, that is to classify
variables. Therefore, factor analysis is applied as a data reduction or structure detection
method. If a scientist has a table of data and he suspects that these data are interrelated in a
complex fashion, then factor analysis may be used to untangle the linear relationships into
their separate patterns.
It can also be used to group interdependent variables into descriptive categories, such
as ideology, revolution, liberal voting, and authoritarianism.
The technique can be used to transform data to meet the assumptions of other
techniques. For instance, application of the multiple regression technique assumes that
predictors are statistically unrelated. If the predictor variables are correlated in violation of
the assumption, factor analysis can be employed to reduce them to a smaller set of
uncorrelated factor scores. The scores may be used in the regression analysis in place of the
original variables, with the knowledge that the meaningful variation in the original data has
not been lost.
1. The first step in factor analysis is the preparation of data matrix, which has two modes
-- entity mode, which represents cases (observations) arranged as rows and the variable
mode, which represents the variables arranged as column. After data matrix, correlation
matrix of variables is prepared.
2. The second step in this analysis is the extraction of common factors that can adequately
explain the observed correlation among the variables. There are several methods of
extraction such as: Maximum Likelihood, Least Square, Alpha Factoring, Image
Factoring, and Principal Component Analysis. The main purpose of extraction is to
know whether a small number of factors can account for the correlation among a much
larger number of variables.
3. There are several criteria to determine the number of initial factors to be extracted by
PC. Notable among them are Scree-Test and Eigenvalue greater than or equal to one
criterion as suggested by Kaiser (1969). In the present analysis, we have applied both
the criteria. Both methods provide the same number of factors.
85
4. The initially extracted factors are rarely interpretable. In order to get the meaningful
results from the initially extracted common factors, the next step is the rotation of these
factors. The purpose of rotation is to achieve the simplest possible factor structure.
Method of rotation can not improve the degree of fit between the data and factor
structure. It makes the results interpretable. There are several methods of rotation. In
orthogonal rotation, three methods: Quartimax, Varimax, and Equimax, are applied,
while in oblique rotation, two methods: Reference Axes, and Primary pattern matrix, are
used. According to Harman (1968), the varimax solution seems to be the “best”
parsimonious analytical solution.
5. Lastly, for interpretation and analysis of factors, variables with highest factor loadings
(weights) are taken into account
Note that as we extract consecutive factors, they account for less and less
variability. The decision of when to stop extracting factors basically depends on when
there is only very little "random" variability left. Kaiser’s criterion of eigenvalues greater
than 1, can be adopted for identification of factors. This criterion, proposed by Kaiser
(1960) is probably the one most widely used. Another method is the scree test first
proposed by Cattell (1966). We can plot the eigenvalues shown above in a simple line
plot.
Cattell suggests to find the place where the smooth decrease of eigenvalues
appears to level off to the right of the plot. According to this criterion, we would
probably retain 2 or 3 factors in our example.
2. Common variance: variance in a variable shared with common factors. Factor analysis
assumes that a variable's variance is composed of three components: common, specific and
error.
86
5. Eigenvalue: the variance in a set of variables explained by a factor or component, and
denoted by lambda. Eigenvalue is the sum of squared values in the column of a factor matrix.
7. Factor scores: linear combinations of variables that are used to estimate the cases'
scores on the factors or components. Least squares estimates of factor scores are the most
commonly used.
8. Parsimony principle: When two or more theories explain the data equally well, select
the simplest theory. Factor analysis application: If a two-factor and a three-factor model
explain about the same amount of variance, interpret the two-factor model.
9. Unique variance: that variance of a variable that is not explained by common factors.
Unique variance is composed of specific and error variance.
10. Varimax rotation: an orthogonal rotation criterion which maximizes the variance of
the squared elements in the columns of a factor matrix. Varimax is the most common
rotational criterion.
The values in this table are correlation coefficients between the factor and the variable. For
instance 0.70 is the r between variable A and Factor I. These correlations are called
87
loadings. Eigen values are the sum of variance of factor loadings. For example, eigen value
for factor I is 0.702 + 0.602+ 0.502+ 0.602+ 0.602. When divided by number of variables, an
eigenvalue yields an estimate of the amount of total variance explained by the factor.
Communalities (h2) measure the variance in each variable that is explained by the two
factors. It is sum of squares of factor loadings of all the factors for a variable. For instance,
with variable A, communality is 702 + -0.402 = 0.65, indicating that 65 per cent of the
variance in variable “A” is statistically explained in terms of factor I and factor II.
Un-rotated factor loadings do not provide meaningful results. They are difficult to
interpret. What one would like to find is some pattern in which factor I would be heavily
loaded on some variables and factor II on others. Such a condition would suggest rather
“pure” constructs underlying each factor. You attempt to secure this less ambiguous
condition between factors and variables by rotation. This procedure can be conducted
through orthogonal method. Rotated factor loadings are given in the table. This shows that
the measurement from six variables may be summarized by two underlying factors.
The interpretation of factor loadings is largely subjective. There is no way to
calculate the meanings of factors; they are what one sees in them. For this reason, factor
analysis is largely used for exploration. One can detect patterns in latent variables, discover
new concepts and reduce data.
MANOVA
Analysis of variance is a special case of regression model, which is generally used to analyse
data collected using experimentation. Multivariate analysis of variance (MANOVA)
examines the relationship between several dependent and independent variables. Whereas
ANOVA assess the differences between groups, MANOVA examines the dependence
relationship between a set of variables across a set of groups. It is a technique which
determines the effects of independent categorical variables on multiple continuous
dependent variables. It is usually used to compare several groups with respect to multiple
continuous variables. The main distinction between MANOVA and ANOVA is that several
dependent variables are considered in MANOVA.
Classification of MANOVA
Assumptions of MANAVA
1. Normal Distribution
The dependent variables should be normally distributed within groups. Overall, the F test
is robust to non-normality, if the non-normality is caused by skewness rather than by
outliers. Tests for outliers should be run before performing a MANOVA, and outliers
should be transformed or removed.
2. Linearity
It assumes that there are linear relationships among all pairs of dependent variables, all
pairs of covariates, and all dependent variable-covariate pairs in each cell.
88
3. Homogeneity of Variances
Homogeneity of variances assumes that the dependent variables exhibit equal levels of
variance across the range of predictor variables. Homoscedasticity can be examined
graphically or by means of a number of statistical tests.
When correlations among dependent variables are high, problem of multicollinerarity and
singularity exists. Multicollinearity – when the relationship between pairs of variables is
high (r>.90). Singularity – a variable is redundant; if it is a combination of two or more of
the other variables.
Example:
A social scientist wished to compare those respondents who had lodged an organ donor card
with those who had not. Three hundred and eighty eight new drivers completed a
questionnaire that measured their attitudes towards organ donation, their feelings about
organ donation and their previous exposure to the issue. It is hypothesized that individuals
who agreed to be donors would have more positive attitudes towards organ donation, more
positive feelings towards organ donation and greater previous exposure to the issues.
Therefore, the independent variable was whether a donor card had been signed and the
dependent variables were attitudes towards organ donation, feelings towards organ donation
and previous exposure to organ donation. Attitudes and feelings are measures on traditional
scales with a Likert scale response format. Exposure was measured in terms of media
exposure and personal experience. Conceputally and theoretically these dependent variables
were believed to be related and so MANOVA was the analysis of choice. Complete data are
available on www.johmwiley.com.au/highered/spssv
Results
Between-Subjects Factors
Value Label N
2 no 188
Descriptive Statistics
89
signed
donor
card Mean Std. Deviation N
Box's M 19.260
F 3.182
df1 6
df2 1018790.282
Sig. .004
Multivariate Testsb
a. Exact statistic
90
Box's Test of Equality of
Covariance Matricesa
Box's M 19.260
F 3.182
df1 6
df2 1018790.282
Sig. .004
Tests the null hypothesis that the error variance of the dependent variable is
equal across groups.
91
donor exposure to donation
903.925 1 903.925 3.830 .051
issues
92
Box’s M tests the homogeneity of the variance-covariance matrices. We have
homogeneity of variance because this test is not significant at an alpha level 0.001.
The multivariate tests of significance test whether there are significant group
differences on a linear combination of the independent variables. We notice that several
statistics are available. Pillai’s Trace Criterion is considered to have acceptable power and
to be the most robust statistic against violation of assumptions. Having obtained a significant
multivariate effect for donor, i.e., a significant of F less than 0.05. an examination of the
univariate F-test for each variable indicates which individual dependent variables contribute
to the significant multivariate effect.
93
13. DATA INTERPRETATION AND REPORT WRITING
2. The quality of the research may be judged directly by the quality of the writing and
how well you convey the importance of your findings.
3. If you are submitting a research report for a class or to an organization, check for
specific requirements and guidelines before beginning to write your research
report.
Types of Report
• Scientific/lab
• Technical
• Business
• Research
• Academic overview
Writing & Editing Your Report: Writing the first draft for yourself
Where do you start writing: the introduction or elsewhere?
• Reports are rarely written in linear order. The order for writing the final sections
may be Conclusions, Introductions & finally the abstract. These are the sections
most likely to be read by readers
• For every 1000 readers who see your title, 100 may read the abstract, perhaps 10
will read some of the main report [conclusions, some results etc], at most 1 may
follow all the way through
• A middle section such as the methods/ system design may be a good starting point
• Writing notes for the introduction, some background theory or a review of previous
studies may help you to clarify the focus of your report.
Writing specifications
Use 10 or 12 point font
• The most acceptable fonts: • Ariel, Times New Roman (the old reliable), Verdana,
Lucida
Unacceptable fonts:Broadway, Brush Script, Chiller, Courier, Freestyle Script, Gigli, Old
English Text, Playbill, etc.
94
• The relative importance of the sections in the report, and the relatedness of
information within sections.
It, Therefore, plays a very important role in communicating meaning to the reader. The
report presents meaning and information in two complementary and equivalent ways:
- The meaning represented by the words, thought, research,
information
- The meaning represented by the layout
Once a system is chosen, the writer must present this system consistently throughout the
report.
• 1.0
• 1.1
• 1.2
1.2.1
1.2.2
1.2.2.1
1.2.2.2
• 2.0
95
Number - letter
(still encountered, but becoming less commonly used)
First level (of importance/generality)
(A heading) I II III IV V VI VII
Second level
(B heading) A B C D E F G
Third level
(C heading) 1 2 3 4 5 6 7
Fourth level
(D heading) (a) (b) (c) (d) (e) (f) (g)
Fifth level
(E heading) (i) (ii) (iii) (iv) (v) (vi) (vii)
• II
– A
General guidelines:
-There are four main pieces of information that have to be included into the title page:
- The report title;
- The name of the person, company, or organization for whom the report has been
prepared;
- The name of the author and the company or university which originated the report;
96
- The date the report was completed.
-A title page might also include contract number, a security classification, or a copy
number depending on the nature of the report you are writing.
Table of contents
• Your report should include a table of contents if longer than about 5-10 pages.
Beginning an introduction
• Also help to place your project in its context (whether that context is background
information or your purpose in writing is up to you).
Consider the following examples; they represent two extremes that writers can take in
beginning their introductions.
• Theuniverse has been expanding from the very moment that it was born.
One of the ways that the sentence above might be rewritten is:
Recent studies suggest that the universe will continue expanding forever and may pick
up speed over time.
97
The rewritten sentence establishes the report’s context within “recent studies” concerning
a specific theory related to universe expansion. This context is much more specific than
that of the original sentence.
In general, the body of the research report will include three distinct sections:
• A section on theories, models, and your own hypothesis
• A section in which you discuss the materials and methods you used in your
research
• A section in which you present and interpret the results of your research
• The headings should be self-explanatory. The main body of the report needs to be
clear, concise and follow a logical order.
• Figures and tables must be referred to in the body of the text and need to have clear
captions. Label figures at the bottom and tables at the top in numerical order.
• Each figure should be capable of being understood on its own using the caption as
the only reference.
• Research articles,
• Formal reports, or
• scientific papers.
The hypothesis that you propose and on which you base your research.
When you develop hypotheses, you predict what you will find after you conduct your
research. This prediction is based on existing theories, models, evidence, and logic.
• Define and explain your hypothesis and the theories and models you used to
develop it.
98
• Define and explain competing hypotheses, theories, and models, including their
strengths and weaknesses.
• Compare and contrast the specific points where they agree or disagree.
Identify
Theoretical framework
Variables
The following questions are good ones to work through: What do I expect this
experiment to reveal? Why?
• How does my hypothesis directly answer the question posed by the problem?
• How does the hypothesis fit in with other hypotheses or more general theory? How
will my work challenge or support the work of others?
• What are alternative views to this theory? What are the strengths and weakness of
those views?
• On what literature did I or can I base my explanation?
All materials and methods sections should address the following questions:
• What sequence of events did you follow as you handled the subjects/materials or as
you recorded data?
99
Results: Presenting data
All preceding sections of the report (Introduction, Materials and Methods, etc.) lead in
to the Results section of the report and all subsequent sections will consider what the
results mean (conclusion, recommendations, etc.).
Consider using figures and tables when you need to decipher information or the analysis
of information, when you need to describe relationships among data that are not apparent
otherwise, and when you need to communicate purely visual aspects of a phenomenon or
apparatus.
Tables or lists are simple ways to organize the precise data points themselves in one-on-
one relationships.
A graph is best at showing the trend or relationship between two dimensions, or the
distribution of data points in a certain dimension (i.e., time, space, across studies,
statistically).
A pie chart is best at showing the relative areas, volumes, or amounts into which a whole
(100%) has been divided.
Flow charts show the organization or relationships between discrete parts of a system. For
that reason they are often used in computer programming.
• The most important general rule is that tables and figures should supplement rather
than simply repeat information in the report.
• You should never include a table or figure simply to include them. This is
redundant and wastes your reader’s time.
• Include a concise title—it is a good idea to make the most important feature of the
data the title of the figure.
• Use legends and clear, concise, descriptive titles for tables and figures.
• · Ensure all axes of graphs are labeled and that units are identified in all
tables and figures
100
Results & Discussion: Interpretation of Data
This section of the report is important because it demonstrates the meaning of your
research.
This section of the paper draws upon writing skills that other sections do not because you
need to write persuasively in this section as you convince readers that your interpretation
of data is logical and correct.
As you develop your argument in this section, consider arranging your evidence in the
order that best highlights your main point, cite authorities that have come to similar
interpretations under similar circumstances, and consider the superiority of your
conclusions to opposing viewpoints.
For most research reports, the most certain part of your case will be your data, and
many research reports will develop along this outline:
Consider how the data addresses the research problem or hypothesis outlined in the
Introduction.
Proceed from most general features of the data to more specific results
Discuss what can be inferred from the data as they relate to other research and
scientific concepts
Compare with other studies and draw conclusions based on your findings.
your results are inadequate, negative, or not consistent with earlier studies or with
your own hypothesis.
Do not try to defend your research or minimize the seriousness of the limitation in
your interpretation; instead, focus on the limitation only as it affects the research
and try to account for it.
101
• The associations between birth weight and cognitive function at ages 8, 11, and 15
are evident across the normal birthweight range (>2.5 kg) and so are not accounted
for exclusively by low birth weight
• Birth weight is also associated with educational attainment, suggesting that the
association between birth weight and cognition may have functional implications
Conclusions
• The conclusions you draw are opinions, based on the evidence presented in the
body of your report, but because they are opinions you should not tell the reader
what to do or what action they should take.
Be sure that you use language that distinguishes conclusions from inferences.
Use phrases like “This research demonstrates . . .” to present your conclusions and phrases
like “This research suggests . . .” or “This research implies . . .” to discuss implications.
Make sure that readers can tell your conclusions from the implications of those
conclusions, and do not claim too much for your research in discussing implications. You
can use phrases such as “Under the following circumstances,” “In most instances,” or “In
these specific cases” to warn readers that they should not generalize your conclusions.
You might also raise unanswered questions and discuss ambiguous data in your
conclusion.
Raising questions or discussing ambiguous data does not mean that your own work is
incomplete or faulty; rather, it connects your research to the larger work of science and
parallels the introduction in which you also raised questions.
The following is an example taken from a text that evaluated the hearing and speech
development following the implantation of a cochlear implant. The authors of “Beginning
To Talk At 20 Months: Early Vocal Development In a Young Cochlear Implant Recipient,”
published in Journal of Speech, Language, and Hearing Research, titled their conclusion
“Summary and Caution.” Using this title calls readers’ attention to the limitations of their
research:
Recommendations
This section appears in a report when the results and conclusions indicate that further work
needs to be done or when you have considered several ways to resolve a problem or improve
a situation and want to determine which one is best.
This gives you another opportunity to demonstrate how your research fits within the larger
project of science.
102
It also demonstrates that you fully understand the importance and implications of your
research, as you suggest ways that it could continue to be developed.
References
• Like the sections on the procedure you used to gather data, they allow other
researchers to build on or to duplicate your research.
• Without references, readers will not be able to tell whether the information that you
present is credible, and they will not be able to find it for themselves.
• Reference sections also allow you to refer to other researchers’ work without
reviewing that work in detail. You can refer readers to your reference page for more
information.
It is best to compile your own reference list containing a variety of information. This will
save you from having to track down pieces of information you may have neglected to make
note of if they are specifically requested after you have filed a source, returned it to the
library, or misplaced it.
The reference list includes only references cited in the text. The author' s surname is
placed first, immediately followed by the year of publication. This date is often
placed in brackets.
The title of the publication appears after the date followed by place of publication,
then publisher (some sources say publisher first, then place of publication).
The important thing is to check for any special requirements or, if there are none, to
be consistent.
1. The Harvard (author-date) system is the one usually encountered in the sciences and
social sciences.
2. Notice that the titles of books, journals and other major works appear in italics (or
are underlined when handwritten), while the titles of articles and s maller works
which are found in larger works are placed in (usually single) quotation marks.
103
1. The Harvard (author-date) system is the one usually encountered in the sciences
and social sciences.
2. Notice that the titles of books, journals and other major works appear in italics (or
are underlined when handwritten), while the titles of articles and smaller works
which are found in larger works are placed in (usually single) quotation marks.
Quotations
When the exact words of a writer are quoted, they must be reproduced exactly in
all respects: wording, spelling, punctuation, capitalisation and paragraphing.
Quotations should be carefully selected and sparingly used, as too many quotations
can lead to a poorly integrated argument.
--the original words are so concisely and convincingly expressed that they
cannot be improved upon
EXAMPLE: The Style Manual (1978, p. 46) states that 'the modern tendency to use single
quotation marks rather than double is recommended.'
104
Long quotations (more than thirty words):
Indent the quotation from the remainder of the text.
Some writers recommend the use of smaller type or italics to set off indented quotations.
Introduce the quotation appropriately, and cite the source at the end of the quotation as
you would in your text
Ellipsis
Irrelevancies within very long quotations can be omitted by the use of an ellipsis which is
indicated by three spaced dots (. . .).
Nowadays it is not usual to place an ellipsis at the beginning or the end of a quotation
which is intended to stand alone or forms part of one of your own sentences.
Appendices
o You should place information in an Appendix that is relevant to your subject but
needs to be kept separate from the main body of the report to avoid interrupting the
line of development of the report.
o An appendix should include only one set of data, but additional appendices are
acceptable if you need to include several sets of data that do not belong in the same
appendix.
o Do not place the appendices in order of their importance to you, but rather in the
order in which you referred to them in your report. You should also paginate each
appendix separately so that the first page of each appendix you include begins with
1.
o A good general rule to follow is to define all terms that you are not completely sure
your audience will understand the same way you do.
o Words to focus on are those key to your research, those relatively new or unfamiliar,
and those that readers could not look up for themselves in a standard dictionary.
Jargon
o You should take your audience into consideration when determining when to include
jargon in your writing.
o Consider their vocabulary and whether they will be familiar with a word or phrase
before you use it.
105
o Do not simply choose to include jargon without taking your audience into
consideration. Jargon can come between your writing and your reader, and readers
who do not understand jargon may see your use of jargon as impolite.
Writing Numbers
1) Spell out numbers between zero and ten and use figures for all other numbers.
Examples: two cats, 11 materials, one attempt, 20,000 residents
Unfortunately, there are a number of exceptions to this general guideline. Make sure that
you are as familiar with the exceptions as you are with the rule.
Exceptions:
Mathematical operations
Raised to the power of 4
Units of measurement
6 feet
Age
9 years old
Time
1 pm
Dates
June 8, 2001
Page numbers
Page 4
Percentages
2 percent
Money
$5
Proportions
100:1
All numbers that begin a sentence should be spelled out.
Seven times the tests failed.
2) When you use two or more numbers in the same section of writing, use figures. This
makes them easier to see and compare.
Exception:
If none of the numbers included is larger than 10, then spell out all of the numbers.
We are requesting funding to purchase nine pumps, six fans, and three ducts.
106
Sequence or range of values
Pages 167-170
Pages 224-35
Number and until of measure used to modify a noun
20-pound dog
20-ounce pitcher
5) Use decimals instead of fractions, whenever possible. Decimals are easier to type and
to read. Write both decimals and fractions as figures.
6) A zero is always placed before the decimal point for numbers less than one.
7) Spell out the shorter of two numbers that appear consecutively in a phrase.
Examples:
Not:But:4 6-inch nails - 4 six-inch nails
20 1,000-piece puzzles - Twenty 1,000-piece puzzles
Writing Measurements
9) Separate the figure from the name of the measure with a space, but do not separate % or
$ from the figure with a space.
Examples: 3.4 hr $22 50%
11) Use figures for years and decades and don’t abbreviate them.
Not:But:’30s The 1930s The fifties The 1950s
Writing Equations
12) Place equations on a separate line and number them consecutively with a number in
parentheses at the right margin.
13) Do not use punctuation after the equation, but punctuate words to introduce equations
as you would words forming any other sentence.
14) Refer to an equation in the body of the text by its number in parentheses.
107
Murphy's Law of Errata Detection: "The very first person to see your mistake is always
the last person you want to know about it.“
Reading your own work, you don’t always spot errors as you may read your draft in the
way you want it to sound
Work with a critical friend- someone who gives honest advice- perhaps outside your field
As soon as you sit in front of the paper with your critical friend, your perspective may
change from that of the writer to a potential reader of the paper
Try using the editing questions provided by the Purdue Uni. Online Writing Lab- next
slide
To Sum up
Use and evaluate all the data you report and do not be discouraged if your results differ
from published studies or from what you expected
Justify all tables and figures by discussing their content and labeling them clearly
Be creative in your presentation of data, your analysis, and your interpretation of
data - play around with different variations before completing your report
Do not force conclusions from your data or fudge data by omitting that which does
not support pre-conceived conclusions
Make sure all calculations and analyses are relevant to the hypotheses you are
testing and the overall objectives of the study
Justify your ideas and conclusions with data, facts, and background literature and
with sound reasoning
108
Ensure to keep the different sections of the report discrete, i.e. methods in the
methods section, results in the results section, and leave discussion and
interpretation of those results for the discussion section
Plan your writing: organize your thoughts and data, and sketch the report before
actually writing. This will help maximize your time efficiency and lead to a
concise, well structured report
109