Final Notes Advance Research Methods
Final Notes Advance Research Methods
Final Notes Advance Research Methods
LMC/DMAS
ADVANCE
RESEARCH
METHODS AND
DATA ANALYSIS
INSTRUCTOR: DR IRFAN
1
Philosophical Assumption
1. What is philosophical theory and its types?
The term 'philosophical assumption' generally refers to a person's beliefs or thoughts about large
universal issues in life.
Philosophy of qualitative research is “interpretive, humanistic, and naturalistic” (Creswell, 2007). It
places significant importance to the subjectivity.
Creswell provides five philosophical assumptions that lead to the choice behind qualitative
research: ontology, epistemology, axiology, rhetorical, and methodological assumptions.
These assumptions cover a range of questions that drive the researcher and influence the
research.
Ontological (The nature of reality): Relates to the nature of reality and its
characteristics. Researchers embrace the idea of multiple realities and report on these
multiple realities by exploring multiple forms of evidence from different individuals’
perspectives and experiences.
Epistemology (How researchers know what they know): Researchers try to get as close
as possible to participants being studied. Subjective evidence is assembled based on
individual views from research conducted in the field.
Axiological (The role of values in research): Researchers make their values known in
the study and actively reports their values and biases as well as the value-laden nature of
information gathered from the field.
Methodology (The methods used in the process of research): inductive, emerging, and
shaped by the researcher’s experience in collecting and analyzing the data.
2. What is theory building?
The process of creating and developing a statement of concepts and their interrelationships to
show how and/or why a phenomenon occurs. Theory building leads to theory testing
Sampling techniques
A sampling technique is the name or other identification of the specific process by which the
entities of the sample have been selected.
Probability sampling involves random selection, allowing you to make strong statistical
inferences about the whole group.
Non-probability sampling involves non-random selection based on convenience or other
criteria, allowing you to easily collect data
2
Sample size
The number of individuals you should include in your sample depends on various factors, including
the size and variability of the population and your research design. There are different sample
size calculators and formulas depending on what you want to achieve with statistical analysis.
Sampling frame
The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it
should include the entire target population (and nobody who is not part of that population).
To conduct this type of sampling, you can use tools like random number generators or other
techniques that are based entirely on chance.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to
conduct. Every member of the population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.
3
If you use this technique, it is important to make sure that there is no hidden pattern in the list that
might skew the sample. For example, if the HR database groups employees by team, and team
members are listed in order of seniority, there is a risk that your interval might skip over people in
junior roles, resulting in a sample that is skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into sub populations that may differ in
important ways. It allows you draw more precise conclusions by ensuring that every subgroup is
properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based on
the relevant characteristic (e.g. gender, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be
sampled from each subgroup. Then you use random or systematic sampling to select a sample
from each subgroup.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup should
have similar characteristics to the whole sample. Instead of sampling individuals from each
subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If the
clusters themselves are large, you can also sample individuals from within each cluster using one
of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there is more risk of
error in the sample, as there could be substantial differences between clusters. It’s difficult to
guarantee that the sampled clusters are really representative of the whole population.
Non-probability sampling techniques are often used in exploratory and qualitative research. In
these types of research, the aim is not to test a hypothesis about a broad population, but to
4
develop an initial understanding of a small
or under-researched population.
1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible to the
researcher.
This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample
is representative of the population, so it can’t produce generalizable results.
Voluntary response samples are always at least somewhat biased, as some people will inherently
be more likely to volunteer than others.
3. Purposive sampling
This type of sampling, also known as judgement sampling, involves the researcher using their
expertise to select a sample that is most useful to the purposes of the research.
It is often used in qualitative research, where the researcher wants to gain detailed knowledge
about a specific phenomenon rather than make statistical inferences, or where the population is
very small and specific. An effective purposive sample must have clear criteria and rationale for
inclusion. Always make sure to describe your inclusion and exclusion criteria.
4. Snowball sampling
5
If the population is hard to access, snowball sampling can be used to recruit participants via other
participants. The number of people you have access to “snowballs” as you get in contact with
more people.
Levels of Measurements
There are four different scales of measurement. The data can be defined as being one of the four
scales. The four types of scales are:
Nominal Scale
Ordinal Scale
Interval Scale
Ratio Scale
Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric
variables or the numbers that do not have any value.
Characteristics of Nominal Scale
A nominal scale variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.
It is qualitative. The numbers are used here to identify the objects.
The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender?
6
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.
Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of data
without establishing the degree of variation between them. Ordinal represents the “order.” Ordinal
data is known as qualitative data or categorical data. It can be grouped, named and also ranked.
Characteristics of the Ordinal Scale
Very often
Often
Not often
Not at all
Totally agree
Agree
Neutral
Disagree
Totally disagree
Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.
Characteristics of Interval Scale:
The interval scale is quantitative as it can quantify the difference between the values
It allows calculating the mean and median of the variables
7
To understand the difference between the variables, you can subtract the values between
the variables
The interval scale is the preferred scale in Statistics as it helps to assign any numerical
values to arbitrary assessment such as feelings, calendar types, etc.
Example:
Likert Scale
Net Promoter Score (NPS)
Bipolar Matrix Table
Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of variable
measurement scale. It allows researchers to compare the differences or intervals. The ratio scale
has a unique feature. It possesses the character of the origin or zero points.
Characteristics of Ratio Scale:
HYPOTHESIS TESTING
Hypothesis testing is a systematic procedure for deciding whether the results of a research study
support a particular theory which applies to a population. Hypothesis testing uses sample data to
evaluate a hypothesis about a population.
6 steps in hypothesis testing
Identify Population and Sample.
State the Hypotheses in terms of population parameters.
State Assumptions and Check Conditions.
Calculate the Test Statistic.
Calculate the P-value.
State the Conclusion.
Types of hypothesis testing in research
The two types of hypotheses are null and alternative hypotheses. Null hypotheses are used to
test the claim that "there is no difference between two groups of data". Alternative hypotheses test
the claim that "there is a difference between two data groups".
8
MEDIATOR VARIABLE:
A mediator variable is the variable that causes mediation in the dependent and the independent
variables. In other words, it explains the relationship between the dependent variable and the
independent variable. The process of complete mediation is defined as the complete intervention
caused by the mediator variable.
Mediation assumption:
Mediation analysis also makes all of the standard assumptions of the general linear model
(i.e., linearity, normality, homogeneity of error variance, and independence of errors). It is strongly
advised to check these assumptions before conducting a mediational analysis.
MODERATION
The term moderating variable refers to a variable that can strengthen, diminish, negate, or
otherwise alter the association between independent and dependent variables. Moderating
variables can also change the direction of this relationship.
Moderation Assumptions
There should be a moderator variable that is a nominal variable with at least two groups. The
variables of interest (the dependent variable and the independent and moderator variables) should
have a linear relationship, which you can check with a scatterplot.
Assumptions of the moderation model include OLS regression assumptions, as described earlier,
and homogeneity of error variance. The latter assumption requires that the residual variance in the
outcome that remains after predicting Y from X is equivalent across values of the moderator
variable.
Conditional mediation:
Conditional mediation (CoMe) analysis integrates mediation and moderation analyses to examine
and test hypotheses about how mediated relationships vary as a function of context, boundaries,
or individual differences.
Moderated Mediation:
Moderated mediation analysis is a valuable technique for assessing whether an indirect effect is
conditional on values of a moderating variable. We review the basis of moderation and mediation
and their integration into a combined model of moderated mediation within a regression framework.
Mediation Moderation:
Mediation analysis is a way of statistically testing whether a variable is a mediator using linear
regression analyses or ANOVAs. In full mediation, a mediator fully explains the relationship
between the independent and dependent variable: without the mediator in the model, there is no
relationship.
What is the difference between mediated moderation and moderated mediation?
If there is a moderation as a whole and the main question is about the process generating this
moderation effect, then in this sense there is a mediated moderation. If, on the other hand, it is a
matter of examining an indirect effect with regard to possible moderators, then it is moderated
mediation.
Stats tool package:
The Stats tools package is a collection of tools that I've either developed or adapted for making
statistical analysis less painful.
9
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal
of discovering useful information for decision-making by users.Data, is collected and analyzed to
answer questions, test hypotheses, or disprove theories.
Data requirements
The data is necessary as inputs to the analysis, which is specified based upon the requirements of
those directing the analytics (or customers, who will use the finished product of the analysis).The
general type of entity upon which the data will be collected is referred to as an experimental
unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age
and income) may be specified and obtained. Data may be numerical or categorical (i.e., a text
label for numbers).
Data collection
Data is collected from a variety of sources. The requirements may be communicated by analysts
to custodians of the data; such as, Information Technology personnel within an organization.The
data may also be collected from sensors in the environment, including traffic cameras, satellites,
recording devices, etc. It may also be obtained through interviews, downloads from online sources,
or reading documentation.
Data processing
The phases of the intelligence cycle used to convert raw information into actionable intelligence or
knowledge are conceptually similar to the phases in data analysis.
Data, when initially obtained, must be processed or organized for analysis.For instance, these may
involve placing data into rows and columns in a table format (known as structured data) for further
analysis, often through the use of spreadsheet or statistical software.
Data cleaning
Once processed and organized, the data may be incomplete, contain duplicates, or contain
errors. The need for data cleaning will arise from problems in the way that the datum are entered
and stored.Data cleaning is the process of preventing and correcting these errors. Common tasks
include record matching, identifying inaccuracy of data, overall quality of existing data,
deduplication, and column segmentation.Such data problems can also be identified through a
variety of analytical techniques.
Exploratory data analysis
Once the data sets are cleaned, they can then be analyzed. Analysts may apply a variety of
techniques, referred to as exploratory data analysis, to begin understanding the messages
contained within the obtained data.The process of data exploration may result in additional data
cleaning or additional requests for data; thus, the initialization of the iterative phases mentioned in
the lead paragraph of this section. Descriptive statistics, such as, the average or median, can be
10
generated to aid in understanding the data.Data visualization is also a technique used, in which
the analyst is able to examine the data in a graphical format in order to obtain additional insights,
regarding the messages within the data.
modeling and algorithms[edit]
Mathematical formulas or models (known as algorithms), may be applied to the data in order to
identify relationships among the variables; for example, using correlation or causation.[34][35] In
general terms, models may be developed to evaluate a specific variable based on other variable(s)
contained within the dataset, with some residual error depending on the implemented model's
accuracy (e.g., Data = Model + Error).
Inferential statistics, includes utilizing techniques that measure the relationships between particular
variables. For example, regression analysis may be used to model whether a change in
advertising (independent variable X), provides an explanation for the variation in sales (dependent
variable Y).In mathematical terms, Y (sales) is a function of X (advertising).It may be described as
(Y = aX + b + error), where the model is designed such that (a) and (b) minimize the error when
the model predicts Y for a given range of values of X.Analysts may also attempt to build models
that are descriptive of the data, in an aim to simplify analysis and communicate results.
Data product
A data product is a computer application that takes data inputs and generates outputs, feeding
them back into the environment.It may be based on a model or algorithm. For instance, an
application that analyzes data about customer purchase history, and uses the results to
recommend other purchases the customer might enjoy.
CORRELATIONAL AND ITS ASSUMPTION
Correlation is a statistical measure that expresses the extent to which two variables are linearly
related (meaning they change together at a constant rate). It's a common tool for describing simple
relationships without making a statement about cause and effect.
The assumptions are as follows: level of measurement, related pairs, absence of outliers, and
linearity. Level of measurement refers to each variable. For a Pearson correlation, each variable
should be continuous
LINEAR REGRESSION
Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable. The regression has five key assumptions: Linear
relationship. Multivariate normality. No or little multicollinearity.
With linear regression we have three assumptions that need to be met to be confident in our
results, linearity, normality, and homoscedasticity.
SIMPLE LINEAR REGRESSION
Simple Linear Regression: A linear regression model with one independent and one dependent
variable.
Assumptions
Assumption 1: Linear Relationship.
Assumption 2: Independence.
Assumption 3: Homoscedasticity.
Assumption 4: Normality
MULTIPLE LINEAR REGRESSION
Multiple Linear Regression: A linear regression model with more than one independent variable
and one dependent variable.
Multiple linear regression analysis makes several key assumptions: There must be a linear
relationship between the outcome variable and the independent variables. Scatterplots can show
whether there is a linear or curvilinear relationship.
11
ISSUED IN DATA ANALYSIS
1. Collecting Meaningful Data
With the high volume of data available for businesses, collecting meaningful data is a big
challenge. Ideally, employees spend much of their time sifting through the data to gain insights,
which can be overwhelming.
2. Selecting the Right Analytics Tool
Without the perfect tool for your business data analytics needs, you may not be able to conduct
the data analysis efficiently and accurately.
Different analytics tools (Power BI, Tableau, RapidMiner, etc.) are available, and they offer varying
capabilities. Besides finding software that fits your budget, you should consider other factors such
as your business objectives and the solution’s scalability, integration capabilities and the ability to
analyze data from multiple sources, etc.
3. Data Visualization
Data requires to be presented in a format that fosters understandability. Usually, this is in the form
of graphs, charts, infographics, and other visuals. Unfortunately, doing this manually, especially
with extensive data, is tedious and impractical. For instance, analysts must first sift through the
data to collect meaningful insights, then plug the data into formulas and represent it in charts and
graphs.
The process can be time-consuming, not forgetting that the data collected might not be all-
inclusive or real-time. But with appropriate visualization tools, this becomes much easier, accurate,
and relevant for prompt decision making.
4. Data From Multiple Sources
Usually, data comes from multiple sources. For instance, your website, social media, email, etc.,
all collect data that you need to consolidate when doing the analysis. However, doing this
manually can be time-consuming. You might not be able to get comprehensive insights if the data
size is too large to be analyzed accurately.
Software built to collect data from multiple sources is pretty reliable. It gathers all the relevant data
for analysis, providing complete reports with minimal risk of errors.
5. Low-Quality Data
Inaccurate data is a major challenge in data analysis. Generally, manual data entry is prone to
errors, which distort reports and influence bad decisions. Also, manual system updates pose the
threat of errors, e.g., if you update one system and forget to make corresponding changes on the
other.
Fortunately, having the tools to automate the data collection process eliminates the risk of errors,
guaranteeing data integrity. More so, software that supports integrations with different solutions
helps enhance data quality by removing asymmetric data.
6. Data Analysis Skills Challenges
Another major challenge facing businesses is a shortage of professionals with the necessary
analytical skills. Without in-depth knowledge of interpreting different data sets, you may be limited
in the number of insights you can derive from your data.
In addition to hiring talent with data analysis skills, you should consider acquiring software that is
easy to use and understand. Alternatively, you could conduct training programs to equip your
employees with the most up-to-date data analysis skills, especially those handling data.
7. Scaling Challenges
12
With the rapidly increasing data volume, businesses are faced with the challenge of scaling data
analysis. Analyzing and creating meaningful reports becomes increasingly difficult as the data pile
up.
This can be challenging even with analytics software, especially if the solution is not scalable.
That’s why it’s important to consult before acquiring a tool to ensure it’s scalable and supports
efficient data analysis as your business grows.
8. Data Security
Data security is another challenge that increases as the volume of data stored increases. This
calls for businesses to step up their security measures to minimize the risks of potential attacks as
much as possible.
There are several ways of mitigating the risks, including; controlling access rights, encrypting data
with secured login credentials, and conducting training on big data. Alternatively, you could hire
the services of cybersecurity professionals to help you monitor your systems.
9. Budget Limitations
Data analysis is a cost-intensive process. It can be a costly investment, from acquiring the right
tools to hiring skilled professionals and training the employees on the basics of data analysis.
Again, with the high volatility of data, the managers have to be proactive to secure the system and
address any security threats while scaling the system to accommodate the growing volume of data.
Ideally, risk management is a small business function, and getting budget approvals to implement
the strategies can be a challenge. Nonetheless, acquiring the necessary tools and expertise to
leverage data analysis. So the managers have to be strategic about the solution they receive and
provide detailed return on investment (ROI) calculations to support the budget.
10. Lack of a Data Culture
The success of data analysis in a business depends on the culture. In a research paper on
business intelligence, 60% of companies claimed that company culture was their biggest obstacle.
However, most companies are not data-driven. They have not equipped the employees yet with
the necessary knowledge on data analysis.
To overcome this challenge, it’s crucial to equip your employees to support data culture by
providing the necessary training.
11. Data Inaccessibility
Data collected can only benefit the business if it’s accessible to the right people. From the analysts
to the decision-makers, businesses need to make sure every key person has the right to access
the data in real-time and be fully empowered with knowledge on how to analyze different data sets
and use the insights.
Mainly, businesses restrict system access for security reasons. But with appropriate security
safeguards, you can enhance safer and unrestricted data access for analysis and decision-making
purposes.
TERMS
1. Auto correlation
Auto correlation represents the degree of similarity between a given time series and a lagged
version of itself over successive time intervals. Autocorrelation measures the relationship between
a variable's current value and its past values.
2. Multicollinearity
Multicollinearity is a statistical concept where several independent variables in a model are
correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is
+/- 1.0. Multicollinearity among independent variables will result in less reliable statistical
inferences.
3. Heteroskedasticity
13
In statistics, heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a
predicted variable, monitored over different values of an independent variable or as related to prior
time periods, are non-constant.
4. Homoscedasticity
Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in
different groups being compared. This is an important assumption of parametric statistical tests
because they are sensitive to any dissimilarities. Uneven variances in samples result in biased
and skewed test results.
5. Variance inflation factor
A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression
analysis. Multicollinearity exists when there is a correlation between multiple independent
variables in a multiple regression model. This can adversely affect the regression results.
6. Skewness and kurtosis
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or
data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a
measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
7. Data normality
"Normal" data are data that are drawn (come from) a population that has a normal distribution.
This distribution is inarguably the most important and the most frequently used distribution in both
the theory and application of statistics. If X is a normal random variable, then the probability
distribution of X is.
A normality test is used to determine whether sample data has been drawn from a normally
distributed population (within some tolerance). A number of statistical tests, such as the Student's
t-test and the one-way and two-way ANOVA require a normally distributed sample population.
8. Parametric test
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example,
height is roughly a normal distribution in that if you were to graph height from a group of people,
one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution.
14