Module 4
Module 4
Through this process, research stakeholders turn qualitative data and quantitative data from a
research study into a readable format in the form of graphs, reports, or anything else that business
stakeholders resonate with. The process also provides context to the data that has been collected and
helps with strategic business decisions.
While it is a critical aspect of a business, data processing is still an underutilized process in research.
With the proliferation of data and the number of research studies conducted, processing and putting
the information into knowledge management repositories like InsightsHub is critical.
The data processing cycle in research has six steps. Let’s look at these steps and why they are an
imperative component of the research design.
A simple example of data analysis can be seen whenever we make a decision in our daily lives by
evaluating what has happened in the past or what will happen if we make that decision. Basically, this
is the process of analyzing the past or future and making a decision based on that analysis.
It’s not uncommon to hear the term “big data” brought up in discussions about data analysis. Data
analysis plays a crucial role in processing big data into useful information. Neophyte data analysts who
want to dig deeper by revisiting big data fundamentals should go back to the basic question, “What is
data?”
A huge part of a researcher’s job is to sift through data. That is literally the definition of “research.”
However, today’s Information Age routinely produces a tidal wave of data, enough to overwhelm even
the most dedicated researcher. From a birds eye view, data analysis:
1) plays a key role in distilling this information into a more accurate and relevant form, making
it easier for researchers to do to their job.
2) provides researchers with a vast selection of different tools, such as descriptive statistics,
inferential analysis, and quantitative analysis.
3) offers researchers better data and better ways to analyze and study said data.
A half-dozen popular types of data analysis are available today, commonly employed in the worlds of
technology and business. They are:
1) Diagnostic Analysis: Diagnostic analysis answers the question, “Why did this happen?” Using
insights gained from statistical analysis (more on that later!), analysts use diagnostic analysis
to identify patterns in data. Ideally, the analysts find similar patterns that existed in the past,
and consequently, use those solutions to resolve the present challenges hopefully.
2) Predictive Analysis: Predictive analysis answers the question, “What is most likely to
happen?” By using patterns found in older data as well as current events, analysts predict
future events. While there’s no such thing as 100 percent accurate forecasting, the odds
improve if the analysts have plenty of detailed information and the discipline to research it
thoroughly.
3) Prescriptive Analysis: Mix all the insights gained from the other data analysis types, and you
have prescriptive analysis. Sometimes, an issue can’t be solved solely with one analysis type,
and instead requires multiple insights.
4) Statistical Analysis: Statistical analysis answers the question, “What happened?” This analysis
covers data collection, analysis, modeling, interpretation, and presentation using dashboards.
The statistical analysis breaks down into two sub-categories:
5) Descriptive: Descriptive analysis works with either complete or selections of summarized
numerical data. It illustrates means and deviations in continuous data and percentages and
frequencies in categorical data.
6) Inferential: Inferential analysis works with samples derived from complete data. An analyst
can arrive at different conclusions from the same comprehensive data set just by choosing
different samplings.
7) Text Analysis: Also called “data mining,” text analysis uses databases and data mining tools to
discover patterns residing in large datasets. It transforms raw data into useful business
information. Text analysis is arguably the most straightforward and the most direct method
of data analysis.
Although there are many data analysis methods available, they all fall into one of two primary types:
qualitative analysis and quantitative analysis.
Qualitative Data Analysis: The qualitative data analysis method derives data via words, symbols,
pictures, and observations. This method doesn’t use statistics. The most common qualitative methods
include:
Quantitative Data Analysis: Statistical data analysis methods collect raw data and process it into
numerical data. Quantitative analysis methods include:
• Hypothesis Testing, for assessing the truth of a given hypothesis or theory for a data set or
demographic.
• Mean, or average determines a subject’s overall trend by dividing the sum of a list of numbers
by the number of items on the list.
• Sample Size Determination uses a small sample taken from a larger group of people and
analyzed. The results gained are considered representative of the entire body.
We can further expand our discussion of data analysis by showing various techniques, broken down
by different concepts and tools.
What is a Parametric Test?
The basic principle behind the parametric tests is that we have a fixed set of parameters that
are used to determine a probabilistic model that may be used in Machine Learning as well.
Parametric tests are those tests for which we have prior knowledge of the population
distribution (i.e, normal), or if not then we can easily approximate it to a normal distribution
which is possible with the help of the Central Limit Theorem.
Parameters for using the normal distribution is –
• Mean
• Standard Deviation
What is a Non-parametric Test?
In Non-Parametric tests, we don’t make any assumption about the parameters for the given
population or the population we are studying. In fact, these tests don’t depend on the
population.
Hence, there is no fixed set of parameters is available, and also there is no distribution (normal
distribution, etc.) of any kind is available for use.
Usually we get more power when we can make valid and correct assumptions about our data. There’s
nothing mysterious about this relationship between power and assumptions. For instance, if we are
measuring the mass of an object by weighing it against a certain volume of water, we can be more
confident in our results if we can assume that the water is pure distilled water; if we cannot assume
that the water is distilled — i.e. it might be distilled, or it might be heavily salted, or contain dissolved
solids, either of which could throw off its weight — then we have to make allowances for those
possibilities. Likewise, if we can assume that data is normally distributed we can use a parametric test
like a T-test, Z-test, or Chi-Squared test, which leverage our mathematical understanding of normal
distributions to reduce our uncertainty about the distribution of errors. This means we can reject the
null at smaller values, thus reducing the chance of Type II errors. However, if we cannot validly assume
that our data is normally distributed, we use a non-parametric test instead. Non-parametric tests
make it harder to reject the null hypothesis, creating a larger chance of Type II error, but they make
allowances for different possible distributions that would create Type I errors if we are make
assumptions that aren't true.
As a rule, Type I error is something that we decide on as a feature of our test design — when we
determine the significance level of our hypothesis test, we are determining exactly how much chance
of falsely rejecting a true null hypothesis we are willing to risk. Type II error is more of an
imponderable, having to do with the validity of the model we adopt and the assumptions we make
about our data. The more unsure we are of our model or our assumptions, the more allowances we
have to make. We would always rather fail to reject a false null hypothesis than reject a true one —
we want to control our chance of make false assertions diligently — and so we go out of our way to
be conservative about Type I errors.
If we want the test to pick up a significant effect, it means that whenever H1 is true, it should accept
that there is significant effect.
In other words, it means that whenever H0 is false, it should accept that there is significant effect.
Again, in other words, it means that whenever H0 is false, it should reject H0. This is represented by
(1-Beta). As seen from the above table, this is defined as the power of the test.
Thus, if we want to increase the assurance that the test will pick up significant effect, it is the power
of the test that needs to be increased.
Types of Errors
There are basically two types of errors:
• Type I
• Type II
Type I Error
The type I error occurs when the researcher finds out that the relationship assumed through
research hypothesis does exist; but in reality, there is evidence that it does not exist. In this
type of error, the researcher is supposed to reject the research hypothesis and accept the null
hypothesis, but its opposite happens. The probability that researchers commit Type I error is
denoted by alpha (α).
Type II Error
The type II error is just opposite the type I error. It occurs when it is assumed that a
relationship does not exist, but in reality it does. In this type of error, the researcher is
supposed to accept the research hypothesis and reject the null hypothesis, but he does not
and the opposite happens. The probability that a type II error is committed is represented by
beta (β).
If results or effects are observed, they If results or effects are observed, they
Observation
are caused by chance. are an outcome of a real cause.
Denoted by H0 H1
The null hypothesis is denoted by the symbol H0. It implies that there is no effect on the population
and that is dependent variable is not influenced by the independent variable in the study. According
to the null hypothesis, the result or effect is caused by chance and establishes no relation between
the two variables. The null hypothesis is generally based on a previous analysis or specialized
knowledge. The main types of null hypotheses are simple, composite, exact, and inexact hypotheses.
To justify the research hypothesis or the argument coined by the researcher, the null hypothesis
constructed against the alternative hypothesis must be proved wrong. It can only turn out in two ways,
either getting rejected or accepted, depending upon the experimental data and nature of the scenario
taken for observation. The null hypothesis is accepted if the statistical test provides no satisfactory
evidence proving the anticipated effect on the population. Furthermore, incorrectly rejecting the null
hypothesis points to type I error (false positive conclusion), and incorrectly failing to reject the null
hypothesis results in type II error (false negative conclusion).
The symbol H1 or Ha denotes the alternative hypothesis. It can be based on limited evidence or belief.
It implies the effect on the population; the independent variable influences the dependent variable in
the study. The alternative hypothesis can be one-sided (directional) or two (non-directional).
The alternative hypothesis defines a statistically substantial relationship between the two variables. It
can be based on limited evidence or belief. From the researcher’s perspective, this statement stands
correct and thus works to reject the contrasting null hypothesis to replace it with the new or improved
theory. The researcher predicts the distinguishing factors between the two variables, ensuring that
the data observed is not due to chance.
In statistics, it is important to know if the result of an experiment is significant enough or not. In order
to measure the significance, there are some predefined tests which could be applied. These tests are
called the tests of significance or simply the significance tests.
This statistical testing is subjected to some degree of error. For some experiments, the researcher is
required to define the probability of sampling error in advance. In any test which does not consider
the entire population, the sampling error does exist.
A test of significance is a formal procedure for comparing observed data with a claim
• The claim is a statement about a parameter, like the population proportion p or the population
mean µ.
• The results of a significance test are expressed in terms of a probability that measures how
well the data and the claim agree.
Technically speaking, the statistical significance refers to the probability of a result of some statistical
test or research occurring by chance. The main purpose of performing statistical research is basically
to find the truth. In this process, the researcher has to make sure about the quality of sample,
accuracy, and good measures which need a number of steps to be done. The researcher has to
determine whether the findings of experiments have occurred due to a good study or just by fluke.
The significance is a number which represents probability indicating the result of some study has
occurred purely by chance. The statistical significance may be weak or strong. It does not necessarily
indicate practical significance. Sometimes, when a researcher does not carefully make use of language
in the report of their experiment, the significance may be misinterpreted.
The psychologists and statisticians look for a 5% probability or less which means 5% results occur due
to chance. This also indicates that there is a 95% chance of results occurring NOT by chance. Whenever
it is found that the result of our experiment is statistically significant, it refers that we should be 95%
sure the results are not due to chance.