Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
tutorial
Open access

Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers

Published: 07 December 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Bayesian data analysis is gaining traction in many fields, including empirical studies in software engineering. Bayesian approaches provide many advantages over traditional, or frequentist, data analysis, but the mechanics often remain opaque to beginners due to the underlying computational complexity. Introductory articles, while successful in explaining the theory and principles, fail to provide a totally transparent operationalization. To address this gap, this tutorial provides a step-by-step illustration of Bayesian hypothesis testing in the context of software engineering research using a fully developed example and in comparison to the frequentist hypothesis testing approach. It shows how Bayesian analysis can help build evidence over time incrementally through a family of experiments. It also discusses chief advantages and disadvantages in an applied manner. A figshare package is provided for reproducing all calculations.

    1 Introduction

    Quantitative analysis of research data often involves statistical inference, typically in the form of hypothesis testing. Hypothesis testing is deciding, from a sample of new observations, which among several competing hypotheses is the best one that explains the researched phenomenon. The inference is made based on a statistical property of the studied population and computed on the collected sample. The property can be any statistical quantity, but often it is the observations’ mean value, its variance, or the proportion of observations that exhibit a studied effect.
    The classical approach to hypothesis testing relies on formulating two competing hypotheses on the population property of interest, whether it is the mean, the variance, a proportion, or another statistic. The first hypothesis is called the null hypothesis: It represents the status quo, the assumed level of the property in the studied population if the researched effect were absent. The competing hypothesis is called the alternative hypothesis, representing the opposite situation where the researched effect is present in the population. Roughly speaking, using the sample at hand alone, we first estimate the property’s expected distribution for samples of the same size. We use this expected distribution to perform a test to figure out how unlikely it is for the collected sample of observations to have been drawn from the status-quo population representing the null hypothesis, the population that does not exhibit the researched effect. Based on this test and a chosen, acceptable margin of error, if it is not likely for the sample to have been drawn from the status-quo population, then we reject the null hypothesis and decide that the alternative hypothesis provides a better explanation for the nature of the collected sample of observations. If this is indeed the case, then we have discovered evidence in favor of the hypothesized effect. If it is not, then we have not discovered the hoped evidence, and we need to maintain the status quo represented by the null hypothesis. The reasoning process just described underlies what is known as the frequentist approach to statistical inference. The class of techniques using this reasoning process to evaluate two mutually exclusive, competing hypotheses based on a statistical property of a sample of observations is called null hypothesis significant test (NHST) [17]. Deeper down, the NHST decision procedure relies on a quantity called the p-value, which equals the probability of a Type-I error or the probability of mistakenly rejecting the null hypothesis when in fact it is true. The lower the p-value, the stronger the evidence in favor of the alternative hypothesis and thus the presence of the researched effect.
    The frequentist approach is what most researchers are familiar with, but it is not the only, most accepted, or, in many cases, the best approach to quantitative data analysis. An arguably more viable option that uses a more direct, less convoluted, and more symmetric thinking to decide among competing explanations of an observed effect in research data is the Bayesian approach.
    In recent years, the push to switch from classical, or frequentist, approaches to Bayesian approaches to analyze research data has dramatically increased. In 2016, the American Statistical Association released a cautionary statement regarding the pervasive use and abuse of p-values in reporting research results based on the frequentist approach [26]. In social and health sciences, notably in psychology and medicine, the push has been shifting attitudes considerably [18].
    Bayesian approaches have been on the radar of the software engineering research community since as early as 1999. Chunali’s doctoral work [4] applied Bayesian thinking for calibrating cost estimation models with prior expert knowledge. In 2010, Sridrahan and Namin [20] delivered a tutorial at the International Conference on Software Engineering to advocate the use of Bayesian approaches for tasks ranging from classification to regression to hypothesis testing in software engineering research. Although the software engineering research community had not quite caught on during those early years, efforts promoting Bayesian approaches have since been intensifying, as exemplified by Robert Feldt’s provocative keynote at the 2019 Empirical Software Engineering Conference [7], an article by him and his colleagues Furia and Torkar detailing the virtues and methods of Bayesian statistics in the software engineering context [8] and their technical briefing delivered at the 2021 International Conference on Software Engineering [23].
    Bayesian analysis, while it is less prone to misuse, more direct in its reasoning process, and attractive in terms of its support for incremental research, has its own caveats. Chief among them is its computational complexity: Except in a few simple cases, there are no formulaic solutions, and the analysis relies on numerical methods for evaluating complicated integrals. These computations are performed using probabilistic sampling techniques, typically Markov Chain Monte Carlo simulation and variations [25]. While many black-box examples can be found in articles discussing the benefits of Bayesian analysis, the examples are not exposed in a way that is totally transparent to the reader, and thus the secret dynamics of the analysis codified in the tools used remain somewhat magical. The discussion of the theory and equations help, but it is hard to glean exactly how the theory is operationalized. Examples given remain insufficiently illustrative for the readers who are new to the concept to allow them to understand what is inside the box. Online expositions of Bayesian statistics, including tutorials and presentations, also abound, but they suffer from the same caveats found in formally published articles. The literature, both formal and gray, lacks a self-contained archival piece that thoroughly reveals the inner workings of the Bayesian approach in the software engineering research context and achieves this using a running example that emphasizes one of the chief benefits of Bayesian approach in this context: gradually building empirical evidence through a series of studies or replications [11].
    My purpose in this article is to demonstrate the Bayesian approach to statistical analysis using a fully developed example drawn from the software engineering context. The example addresses a plausible research question and is simple enough to allow complete manual analysis, but realistic enough to demonstrate both the principles and the process, including the mechanics of each step and the underlying mathematics, without relying on black-box tools and statistical packages. The process includes formulating hypotheses, representing prior knowledge, deriving a posterior distribution, calculating the main decision-making statistic involved (the Bayes factor), and accumulating evidence over time via replication studies. For comparison, I provide the corresponding frequentist approach at each step, and highlight the differences.
    The population property of interest in the article’s running example is the proportion rather than the more usual mean. Proportions are easier to understand and work with using simple mathematics. A few of the frequentist calculations for proportions require the standard normal and inverse normal distribution functions: These should be familiar to most readers, and readily available in spreadsheets and statistical software packages. When dealing with proportions and discrete prior models, Bayesian calculations require only introductory results from probability. A spreadsheet version of the example containing all the calculations given here is included in a figshare package for those who wish to check and reproduce those calculations [6].

    2 An Example

    2.1 Is C Faster than Python?

    Suppose we are trying to determine if C programs are faster than their Python counterparts. This is our research question. We know from past work that C programs are generally faster than their Java counterparts. The past work shows C is faster than Java between 40% and 80% of the time. We have valid reasons for believing that the C results could be even better relative to Python than they were relative to Java. This bit of information constitutes our prior knowledge about our research question. What we already know is not about Python vs. C, which would have been desirable, and it seems somewhat vague, incomplete, or informal. But it is all we have, and we can still use it.

    2.2 The Initial Experiment

    To answer our research question, we design a new first experiment to compare C and Python. We find a random, but sufficiently representative, set of 28 Python programs implementing well-known algorithms, and an expert re-implements each program in C. We end up with 28 Python programs and 28 C versions. This is a paired sample: Each C implementation has a corresponding Python implementation and vice versa. Here 28 is our sample size. We assume that the whole population of Python programs of interest is large enough, say, at least 10 times larger than the sample of 28 Python programs. This assumption, which we can very safely keep, allows us to use simpler math based on the binomial distribution in our calculations: It guarantees that sampling without replacement sufficiently approximates sampling with replacement [27].
    Suppose each program comes with standard input data for benchmarking purposes. We take the first pair and feed this input data to both the C and Python versions. We record which program finishes first. This is a single trial of the experiment. We do this for every pair, for a total of 28 times. Suppose that, at the end of the experiment, the data look like what is shown in Table 1. We record that 17 times of 28, or 60.71% of the time, the C version was faster than the Python version.
    Table 1.
    Trial #1234567891011121314
    C faster?
    Trial #1516171819202122232425262728
    C faster?
    # of Trials (N)28
    Frequency[C faster] (n)17
    Proportion[C faster] (r)60.71%
    Table 1. Data at the End of Initial Experiment
    What can we conclude from this result?

    2.3 Formulating the Hypotheses

    Let us formulate our hypotheses first. The null hypothesis often represents a presumed falsehood with which we wish to dispense by finding evidence against it and thus answering our research question favorably. In this case, we suspect that the falsehood is “C programs are in general no faster than their Python versions.” We need to express this sentence in terms of a statement about a proportion.
    Let n be the observed frequency of faster C programs in a sample of size N corresponding to N trials. Let the proportion, or rate, of faster C versions be r. That is,
    \begin{equation*} r = \frac{n}{N}. \end{equation*}
    Under the null hypothesis, r should not to exceed 0.5 or 50%. Now we can formalize our \(H_0\) :
    \begin{equation*} H_0: r \le 0.5. \end{equation*}
    The alternative hypothesis simply captures the opposite belief, that “the C version is faster” on average. The opposite belief corresponds to a favorable answer to our research question, which we represent as
    \begin{equation*} H_1: r \gt 0.5. \end{equation*}
    We have good reason to believe C is on average faster than Python. They could be equally frequently fast, but we do not expect a result in favor of Python, knowing what we know about both languages. Therefore, the hypotheses are directional.
    Here the constant 0.5 is the upper limit of the expected proportion under \(H_0\) , or the best-case value of r we should expect from the experiment if \(H_0\) were indeed true. In hypothesis testing, sometimes this comparison value is called the benchmark value. To differentiate it from the random variable denoting the observed proportion r, we write it as \(r_0\) , with the subscript indicating that it is associated with \(H_0\) :
    \begin{equation*} r_0 = 0.5. \end{equation*}

    3 The Frequentist Approach

    3.1 The General Process

    The frequentist approach relies on finding the right null-hypothesis significance test. The test is chosen depending on the population (and sample) property about which we are hypothesizing. We compute the proper test statistic relying on the property’s observed value from the experiment, the sample size, and the property’s best-case expected value under the null hypothesis. The test statistic is then compared to a critical value based on a chosen confidence level. The critical value determines the limits of the test statistic’s probability density function given the parameters of the experiment. If the test statistic falls outside these limits, then we reject the null hypothesis. Otherwise, we do not, or we risk a high Type-I error that exceeds the one allowed by the confidence level chosen, thus falsely rejecting \(H_0\) when it is likely to be true. The whole process relies on minimizing this error rate. If we reject the null hypothesis, then we have learned something: The sample is different from the hypothesized population under \(H_0\) in a statistically detectable manner. Our test is said be statistically significant, and we just generated support for the alternative hypothesis, which explains the detected difference. We can also compute other statistics, such as effect sizes and confidence intervals to further qualify and strengthen the result. These additional statistics together help us interpret the result in the given context. If we cannot reject the null hypothesis, then we have no other choice but retain the status quo: Then we have not learned much.

    3.2 Choosing the Right Test for the Example

    For the frequentist researcher, the problem is simple enough. The result “17 of 28 trials”—representing a rate of 60.71% of the time—is a proportion. This is the property of the population about which we are hypothesising. We can apply a continuous version of the binomial test assuming certain assumptions are in place.

    3.3 Calculating the Test Statistic

    Proportions are easy to deal with. Much like a coin toss, we can think of each trial as an independent random binary variable with two possible outcomes, either the C version wins or the Python version wins.
    We would not select the same program in more than one trial for the experiment to make sense, which makes the trials dependent. Fortunately, if the population is much larger than the number of trials—and we can safely assume that it is in this case—then we can treat the trials as independent even if they were not.
    If the probability of C winning is p, then the result, the number of trials in which C wins after performing N trials, has a binomial distribution with mean pN and variance \(Np(1 - p)\) . If N is large enough, then this distribution is approximately normal. The rule of thumb for large N is for both pN and \(N(1 - p)\) to be greater than or equal to 10. For our sample of 28 trials ( \(N = 28\) ), this rule of thumb is satisfied.
    If the frequency of some favorable outcome has a binomial distribution, then the rate of the favorable outcome—in this case, the number of times C wins divided by 28, the number of trials—has a proportion distribution with mean p and variance \(p(1 - p)/N\) . For large-enough N ( \(Np \ge 5\) and \(N(1-p) \ge 5\) ), the proportion distribution approaches the normal distribution, so we can assume it is normal.
    With this result, and substituting r for p, we have almost all we need for a null-hypothesis significance test suitable for proportions. First, let us standardize the proportion distribution to approximate it by the standard normal distribution and obtain a test statistic that we can conveniently use with the normal distribution. To standardize the distribution, we simply subtract \(r_0\) , the best-case expected value of the proportion under \(H_0\) , from the observed proportion r and divide the difference by the standard deviation of the best-case expected proportion under \(H_0\) . The computation yields the familiar z-score:1
    \begin{equation*} z = \frac{ r - r_0 }{\sqrt {r_0(1 - r_0)/N}}. \end{equation*}
    Putting in the observed and benchmark values in the above formula, we obtain the following z-score for our small experiment:
    \begin{equation*} z = \frac{0.6071 - 0.5}{\sqrt {0.5(1 - 0.5)/28}} = 1.1339. \end{equation*}

    3.4 Interpreting the Test Statistic

    All that remains is to compare the above z-score to a threshold value corresponding to a confidence level that we choose and use the comparison to decide whether we can reject \(H_0\) . Let us adopt 95% as the confidence level, which is what is usually done in software engineering experiments that are subject to variability due to multiple human factors.2
    From the normal distribution, the one-tailed threshold z-value corresponding to this level is 1.6449. This is called the critical value of the test statistic. We observe that it is above the computed test statistic, the z-score. Therefore, we cannot reject \(H_0\) based on the selected confidence level.
    We just applied the binomial null hypothesis significance test to analyze this experiment in an attempt to answer our research question. Alas, we could not answer it very well: Instead, we ended up nearly where we started. Here is why. With the frequentist approach, when we are unable to reject \(H_0\) ; unfortunately we cannot simply conclude that “we accept \(H_0\) ” and get it over with: At best, we can say “we retain \(H_0\) .” The difference between “accepting” and “retaining” may feel subtle, but it is important in terms of statistical reasoning. The former implies a brand-new insight (even if it is in opposition to \(H_1\) , what we wanted to show), while the latter implies the still fuzzy status quo, what we wanted to know, but did not, at the moment of posing the research question. The resulting asymmetry is an unfortunate caveat of the frequentist thinking.
    There is not much more we can do here, but, for completeness, Table 2 provides further ordinarily reported frequentist statistics. These include (a) the infamous p-value, the probability of rejecting \(H_0\) when \(H_0\) is true; (b) the effect size as measured by Odds Ratio (O.R.) for proportions;3 and (c) the lower confidence limit (C.L.) of the observed proportion (the upper limit is infinity for the one-tailed test.)
    Table 2.
    r (observed)0.6071 N28   
    \(r_0\) (expected)0.5000 Normal? Odds
    7-8 Diff0.1071 Conf. Level0.95  \(H_0\) 1.0000
    StdErr0.0945 z-critical1.6449  \(H_1\) 1.5455
    z-score1.1339 p-val0.1284 O.R.1.5455
    Lower C.L.0.4517 Reject \(H_0\) ?   
    Table 2. Frequentist Analysis Results for Initial Experiment: One-Tailed Binomial Test

    4 The Bayesian Approach

    The Bayesian approach follows a different logic. Instead of treating \(H_0\) and \(H_1\) asymmetrically and trying to refute \(H_0\) to show evidence in favor of \(H_1\) , it attempts to build on prior knowledge, or beliefs, to revise that knowledge in light of new data. With this thinking, there is no qualitative difference between which hypothesis is \(H_0\) and which hypothesis is \(H_1\) , and the asymmetry is broken. \(H_0\) and \(H_1\) are treated simply as competing theories, and one does not have the special status of representing something to be disproved. They do not even have to be mutually exclusive: One does not have to be the other’s opposite as in the case of NHST. The “differential diagnosis” type reasoning (eliminate what is implausible to infer what is more plausible) of the frequentist approach is therefore broken. There is also no immediate binary verdict, reject or cannot reject, based on an arbitrary threshold, the confidence level, representing our predetermined tolerance to error. Instead, we can have a measure of strength for evidence—the Bayes factor (BF)—quantifying the favorability of one hypothesis over another [12]. Although this measure’s level has interpretation guidelines similar to those for effect sizes in frequentist analysis—the odds ratio for proportions or Cohen’s d for the difference between means—it does not lead to a strict pass-fail decision in the same manner. Instead, it is open to interpretation in the research question’s particular context.
    Of course, the pass-fail conclusion in hypothesis testing with the frequentist approach is not an accidental bug, but a feature: It is a practical consideration for decision making. In practice, we cannot indefinitely remain on the fence about a research question whose answer is supposed to help make a practical decision, for example, whether to roll out a new vaccine to a population or roll out a new software practice inside an organization. If evidence against the status quo that we are hoping to change is not strong enough (a failing hypothesis test), then we have no choice but forgo the roll-out, at least until we have new information. In the short term, we may think we are still on the fence, but effectively the decision is automatically made for us by way of inaction in the long-term: Doing nothing is also a decision. Bayesian analysis also faces this ultimate faith in practical applications. With Bayesian analysis, we are not necessarily indefinitely avoiding a binary decision (we may still have to decide whether to undertake an action or not depending on the results) but rather availing ourselves of tools that are more straightforward and easier to interpret when evaluating the evidence.

    4.1 Prior Knowledge

    One complication with the Bayesian approach is that we need a starting point: the prior knowledge. In the current example, the prior knowledge was captured by the following prose:
    We know from past work that C programs are generally faster than their Java counterparts. The past work shows C is faster than Java between 40% and 80% of the time. We have valid reasons for believing that the C results could be even better relative to Python than they were relative to Java.
    Again, this is not much: It is admittedly vague and incomplete, but it is a start. Let us formalize the above statement as a model.
    A model in Bayesian statistics is the probability distribution of a studied effect. The distribution represents the uncertainty of the current knowledge about that effect. Any parameters of the distribution can themselves be uncertain, subject to their own probability distributions. In our case, our modeling task is straightforward: We consider a range of values for prior observations of r, the observed proportion of faster C programs in a somewhat related context. The only information we have about that range comes from the statement quoted above. In the absence of further information, the simplest thing we can do is to assume the values in the given range to be equally likely.
    With this in mind, we treat r as a parameter of the prior knowledge. The prior knowledge is uncertain, represented by a range of possibilities: If we fix the value of r, then that uncertainty is gone. Further, the parameter r has a uniform distribution in the range \([0.4, 0.8]\) . This is a continuous distribution. Dealing with a continuous distribution in Bayesian statistics requires numerical methods based on sampling the distribution: We would have to sample r many times over to complete the calculations necessary. So let us simplify the exposition, and the prior model by pre-sampling r in the given range: Say we take five equally spaced samples of r each with a probability of 0.2. The result is the discrete distribution given in Figure 1. Even without the discretization, the choice of the distribution feels rather arbitrary, but it is a reasonable starting point given the lack of further information. We are not too worried, because as we gather more data, and more evidence, be it in favor or against the alternative hypothesis, our starting point’s importance will diminish over time. You can think of the prior model as a representation of our initial bias, which will ultimately be overtaken by new evidence over time. Also, if we wish, we can later determine how sensitive our conclusions are to the prior knowledge’s exact nature by rerunning the analysis with other plausible prior models, and the sensitivity analysis may in fact show that, over time, the starting point does not matter much.
    Fig. 1.
    Fig. 1. The prior model for the initial experiment, represented by the distribution of r based on what we knew about the performance of C programs relative to another programming language.
    Given the prior model in Figure 1, we immediately know two things at the outset: the likelihood of \(H_0\) being true and the likelihood of \(H_1\) being true:
    \begin{equation*} P[H_0] = \int \limits _{r\ \le \ 0.5}^{} P[r] dr \hspace{56.9055pt} P[H_1] = \int \limits _{r\ \gt \ 0.5}^{} P[r] dr. \end{equation*}
    In our case, because r has a discrete distribution, the integrals reduce to simple summations. Also our competing hypotheses happen to be mutually exclusive. But generally, in the Bayesian approach, they do not need to be.
    The ratio of these quantities is called the prior odds:
    \begin{equation*} \mathrm{Prior\ Odds}_{10} = \frac{P[H_1]}{P[H_0]}. \end{equation*}
    The subscript is used to clarify which quantity is in the numerator and which in the denominator. Prior odds are a summary of the information regarding the relative strength of one hypothesis over the other based on the prior knowledge alone.
    In the example
    \begin{equation*} \mathrm{Prior\ Odds}_{10} = \frac{0.6}{0.4} = 1.5. \end{equation*}
    Thus, we start with the belief that \(H_1\) is 1.5 times more likely than \(H_0\) . But with the initial experiment, we now have new data, and new information: How does this new information change our initial belief?

    4.2 Some Terminology

    Before proceeding further, it is worthwhile to review a few pairs of basic probability terms that matter in Bayesian reasoning. The differences between the terms in each pair may be subtle, but the nuances are sometimes important. Namely, I will discuss the terms probability vs. likelihood, marginal vs. conditional, and prior vs. posterior.
    Both probability and likelihood quantify uncertainty as a percentage on a scale of 0 to 1. We use the symbol \(P[.]\) for both, but this does not mean that they are interchangeable. The term probability is used to quantify the uncertainty of the possible outcomes of a phenomenon, such as an observation (a measurement taken from a population), the value of a random variable, or to give more concrete examples, the result of a tennis tournament, or which of two randomly chosen programs runs faster. By definition, possible outcomes are mutually exclusive, and if we could enumerate all of them, then the sum of their probabilities would add up to 1: Outcomes result in a probability distribution. In contrast, the term likelihood is used to quantify the uncertainty of a hypothesis. Unlike competing outcomes, competing hypotheses can intersect, or one hypothesis can subsume another. It may not be easy, or even possible, to imagine all possible hypotheses of interest in a given context. When this is the case, the sum of hypothesis “probabilities” is not necessarily 1: It is because these are not probabilities in the ordinary sense (relative to all possible mutually-exclusive hypotheses), so a different term, likelihood, is used to make the differentiation. Unlike outcomes, likelihoods do not give rise to probability distributions: They give rise to likelihood functions.
    In NHST, competing hypotheses are necessarily mutually exclusive: The alternative hypothesis is often simply the opposite of the null hypothesis, so they may look like outcomes even though they really are not. In Bayesian reasoning, we can be much more flexible: We can have two hypotheses that overlap, or we can have more than two hypotheses some of which overlap. For example one hypothesis may be that “C programs are equally fast as Python programs,” another may be “C programs are faster than Python programs,” and a third one may be “C programs are at least 20% as fast as Python programs.” Clearly, the likelihoods of these three hypotheses cannot add up to 1, and the second hypothesis subsumes the third.
    The terms marginal and conditional can apply to both probabilities and likelihoods. Implicit in both cases is a (possibly hidden) second factor, another event or some underlying parameter, that affects the quantified uncertainty. When we say marginal, we are quantifying the uncertainty irrespective of that second factor. When we say conditional, we are quantifying the uncertainty in the presence of that second factor when the factor assumes a specific value. If we know all possible outcomes of the event representing the second factor, or all possible values of the parameter representing the second factor, and their respective probabilities, then we could compute the marginal uncertainty (whether probability or likelihood) from the conditional uncertainties by integrating, or summing, over all such possibilities. When we compute a probability or likelihood in this way, we say that the uncertainty is “marginalized out,” and the computation process itself is called marginalization.
    The terms prior and posterior refer to a phenomenon’s uncertainty before and after, respectively, collecting new data on that phenomenon. These terms can also apply to probabilities or likelihoods (or probability distributions or likelihood functions). The prior uncertainty represents what we know about a phenomenon (or some property of a population) before we collect new data, or record new observations. The posterior uncertainty represents the change in the prior uncertainty in the presence of the new data: Thus, the posterior is a function of the prior and the newly recorded observations.
    With the above terms hopefully sufficiently clarified, we can turn our attention back to our running example.

    4.3 Bayes Factor

    With the Bayesian approach, we wish to know whether the new data we just observed in the first experiment is more likely under \(H_0\) or \(H_1\) . The ratio that gives us this information is known as the BF:
    \begin{equation*} \mathrm{BF}_{10} = \frac{P[\mathrm{Data}|H_1]}{P[\mathrm{Data}|H_0]}. \end{equation*}
    Again, the subscript tells us which hypothesis is in the numerator and which in the denominator of the ratio. Here, the numerator \(P[\mathrm{Data}|H_1]\) is the likelihood of the observed result under \(H_1\) : If \(H_1\) were true, then what would be the odds of getting the observed result from the experiment? Likewise, the denominator \(P[\mathrm{Data}|H_0]\) is the likelihood of the observed result under \(H_0\) . The term \(P[\mathrm{Data}|H_k]\) is known as the marginal likelihood of \(H_k\) . We say “marginal” because the quantity is actually computed by integrating over all parameter values of r possible under the specified hypothesis. Although the quantity here is not in fact the uncertainty of the specified hypothesis, but of the new observations under that hypothesis, it is still more a statement about the hypothesis ( \(H_0\) or \(H_1\) ) rather than the particular outcome ( \(\mathrm{Data}\) ): Note that the sum \(P[\mathrm{Data}|H_1] + P[\mathrm{Data}|H_0]\) does not equal 1.
    Thus, the Bayes factor tells us which hypothesis predicts the data better and to what extent. In other words, \(\mathrm{BF}_{10}\) answers the question: How much better did \(H_1\) predict the observed data over \(H_0\) ? If \(\mathrm{BF}_{10} \gt 1\) , then there is that much more evidence for \(H_1\) compared to \(H_0\) . If, however, \(\mathrm{BF}_{10} \lt 1\) , then there is that much more evidence for \(H_0\) compared to \(H_1\) . Note that the comparison is symmetric: \(H_0\) does not have a special status here. It just happens to represent one of the competing theories, or models, that are being evaluated. In addition, \(H_0\) and \(H_1\) need not be mutually exclusive in general, unlike in the frequentist approach.

    4.4 Interpretation Guidance for Bayes Factor

    Just as there is interpretation guidance for standardized effect sizes such as Cohen’s d and odds ratio, we also have interpretation guidance for the Bayes factor. For example, for Cohen’s d, the magnitude of the effect is often considered small if \(d \lt 0.2\) , medium when d approaches 0.5 or is around 0.5, and large when \(d \gt 0.8\) [3]. For the Bayes factor, we have a similar classification shown in Figure 2, but the interpretation is about evaluating the relative strength of evidence rather than about evaluating the effect’s magnitude. The guidance is just that: It is neither definitive nor universal, and it should be adapted to the context as appropriate. Therefore, for practical purposes, it is arguably not too dissimilar to how we evaluate p-values, or statistical significance, relative to a chosen confidence level.
    Fig. 2.
    Fig. 2. Bayes factor interpretation guidance. The top half indicates evidence in favor of \(H_0\) and the bottom half in favor of \(H_1\) .

    5 Calculating Bayes Factor

    To calculate the Bayes factor, we need to evaluate both the numerator and denominator of the equation:
    \begin{equation*} \mathrm{BF}_{10} = \frac{P[\mathrm{Data}|H_1]}{P[\mathrm{Data}|H_0]}. \end{equation*}
    The calculation can be complex, since each hypothesis will need to be decomposed into all ways it can be true, that is, under each possible value of r that supports the hypothesis. This means evaluating \(P[\mathrm{Data}|H_k]\) for a specific r value and aggregating the probabilities resulting from each evaluation. If r has a continuous distribution, then the computation involves evaluating an integral over r; if r is discrete, then it is a simple summation. Note that we resort to marginalization here, the evaluation process explained in Section 4.2, where we marginalize out the parameter r to compute the likelihood we are looking for.
    Suppose an event E can happen under only two mutually exclusive conditions, \(C_1\) and \(C_2\) , whose probabilities we know. The law of conditional probabilities tells us that we can decompose the calculation of E’s likelihood, \(P[E]\) , as follows:
    \begin{equation*} P[E] = P[E|C_1]P[C_1] + P[E|C_2]P[C_2]. \end{equation*}
    If we can figure out the \(P[E|C_i]\) , or the conditional likelihood of E under each \(C_i\) , then we can calculate \(P[E]\) . We can apply this law to the running example.

    5.1 Marginal Likelihoods for Initial Experiment

    To marginalize r of \(P[\mathrm{Data}|H_k]\) , we need to determine \(P[\mathrm{Data}|r,H_k]\) , the likelihood of observing the data under \(H_k\) when r has a specific value. Then we can write the marginal likelihood of \(H_k\) as
    \begin{equation*} P[\mathrm{Data}|H_k] = \int \limits _{r\ \mathrm{possible\ under}\ H_k} {P[\mathrm{Data}|r,H_k]P[r|H_k]dr}. \end{equation*}
    Under \(H_0\) , possible values of r are those smaller or equal to 0.5. Under \(H_1\) , possible values of r are those greater than 0.5. \(P[r|H_k]\) is the conditional probability of r having a specific value given \(H_k\) is true. In the current example, both \(P[r = 0.4]\) and \(P[r = 0.5]\) equals 0.2 under \(H_0\) . So, we have the following:
    \begin{equation*} P[r = 0.4|H_0] = P[r = 0.5|H_0] = \frac{0.2}{0.2 + 0.2} = 0.5. \end{equation*}
    For \(H_1\) , r can assume three values, 0.6, 0.7, or 0.8, each, again, with the same probability 0.2. We have the following:
    \begin{equation*} P[r = 0.6|H_1] = P[r = 0.7|H_1] = P[r = 0.8|H_1] = \frac{0.2}{0.2 + 0.2 + 0.2} = \frac{1}{3} = 0.3333. \end{equation*}
    Table 3 summarizes the prior model for the initial experiment with all the conditional probabilities of r as computed above. These conditional probabilities under a specific hypothesis \(H_k\) constitute the marginal distribution of r under \(H_k\) .
    Table 3.
    r40%50%60%70%80%
    \(P[r]\) 0.20000.20000.20000.20000.2000
    \(H_0\) or \(H_1\) ? \(H_0\) \(H_0\) \(H_1\) \(H_1\) \(H_1\)
    \(P[r|H_0]\) 0.50000.50000.00000.00000.0000
    \(P[r|H_1]\) 0.00000.00000.33330.33330.3333
          
    \(P[H_0]\) 0.4000
    \(P[H_1]\) 0.6000
    \(\mathrm{Prior\ Odds}_{10}\) 1.5000
    Table 3. Prior Model for the Initial Experiment with Marginal Distributions and Prior Odds

    5.2 Conditional Likelihoods of an Observation under a Specific Proportion

    Next, we need to compute conditional likelihoods \(P[\mathrm{Data}|r,H_k]\) for each possible r. This is the likelihood of observing the result we obtained in the initial experiment under one of the hypotheses for a specific value of r possible under that hypothesis. It can easily be computed from the binomial distribution: If C was faster in n of N trials, then the probability of observing this outcome when the probability of C being faster equals r is given by
    \begin{equation*} P[\mathrm{Data = n/N}|r,H_k] = C(n, N)r^n(1-r)^{(N-n)}, \end{equation*}
    where C is the combination function, with \(C(n, N)\) being the number of ways we can select n items from a sample of N. Here \(N = 28\) and \(n = 17\) . We need to compute this conditional quantity for all values of r, some of which are only possible only under \(H_0\) , others only under \(H_1\) :
    \begin{equation*} \begin{split}P\left[\mathrm{Data = \frac{17}{28}}|r = 0.4,H_0\right] & = 0.0134 \\ P\left[\mathrm{Data = \frac{17}{28}}|r = 0.5,H_0\right] & = 0.0800 \\ P\left[\mathrm{Data = \frac{17}{28}}|r = 0.6,H_1\right] & = 0.1525 \\ P\left[\mathrm{Data = \frac{17}{28}}|r = 0.7,H_1\right] & = 0.0885 \\ P\left[\mathrm{Data = \frac{17}{28}}|r = 0.8,H_1\right] & = 0.0099. \\ \end{split} \end{equation*}
    Note that the sum of these “probabilities” does not equal 1, because \(P[\mathrm{Data} |r]\) , when interpreted as a function of r, does not yield a probability distribution. It is therefore technically a likelihood function with parameter r.

    5.3 Evaluating Marginal Likelihood Integrals

    All that remains is to compute the integrals for the numerator and denominator of the Bayes factor. The integrals once again reduce to simple summations:
    \begin{equation*} \begin{split}P[\mathrm{Data}|H_0] & = \int _{r\ \le \ 0.5} {P[\mathrm{Data}|r,H_0}]P[r|H_0]dr \\ & = {P[\mathrm{Data}=\frac{17}{28}|r=0.4,H_0}]P[r=0.4|H_0] \\ & \ \ \ \ + {P[\mathrm{Data}=\frac{17}{28}|r=0.5,H_0}]P[r=0.5|H_0] \\ & = (0.0134)(0.5) + (0.0800)(0.5) = 0.0467. \end{split} \end{equation*}
    The sum for \(P[\mathrm{Data}|H_1]\) has just one extra term, because \(H_1\) admits three distinct r values:
    \begin{equation*} \begin{split}P[\mathrm{Data}|H_1] & = \int _{r\ \gt \ 0.5} { P[\mathrm{Data}|r,H_1]P[r|H_1]dr } \\ & = P[\mathrm{Data}=\frac{17}{28}|r=0.6,H_1]P[r=0.6|H_1] \\ & \ \ \ \ \ \ + {P[\mathrm{Data}=\frac{17}{28}|r=0.7,H_1}]P[r=0.7|H_1] \\ & \ \ \ \ \ \ + P[\mathrm{Data}=\frac{17}{28}|r=0.8,H_1]P[r=0.8|H_1] \\ & = (0.1525)(0.3333) + (0.0885)(0.3333) + (0.0099)(0.3333) \\ & = 0.0836. \end{split} \end{equation*}

    5.4 Final Step: Bayes Factor

    Finally, the Bayes factor is
    \begin{equation*} \mathrm{BF}_{10} = \frac{0.0836}{0.0467} = 1.7909. \end{equation*}
    The value of the Bayes factor supports \(H_1\) : The observed proportion is 1.7909 times more likely under \(H_1\) than it is under \(H_0\) . If we follow the proposed interpretation guidelines, then we may conclude the following: Even though the initial experiment provides more support for \(H_1\) than it does for \(H_0\) , the strength of the evidence is pretty weak. The odd Python program that runs faster than its C version is not rare enough. If we wanted to use the result to decide whether to rewrite all of our existing Python programs in C or not, then we would probably not deem such an effort worthwhile at this level of evidence.
    We can perform the calculation for all possible outcomes to see how the outcome affects the Bayes factor under the adopted prior model. Figures 3 and 4, respectively, give the marginal likelihoods and the corresponding Bayes factors for all possible outcomes.
    Fig. 3.
    Fig. 3. Marginal likelihoods, or evidence, for \(H_0\) and \(H_1\) for all possible outcomes of the initial experiment. The dark pair of bars in the middle represents the actual observation corresponding to C being faster in 17 of 28 trials.
    Fig. 4.
    Fig. 4. Bayes factor for all possible outcomes of the initial experiment (logarithmic scale). The darkest bar in the middle represents the actual observation corresponding to C being faster in 17 of 28 trials.

    6 Posterior Model

    If Bayesian hypothesis testing ended with the calculation of the Bayes factor, then it would not be that useful. The real benefit comes from the evidence under the new data being able to update the prior model, giving rise to a posterior model. The posterior model can then serve as a prior model for the next round of analysis when a new experiment or a replication of the old experiment is conducted with additional data. But how do we update the prior model to obtain the posterior model?

    6.1 Transforming Prior Distribution to Posterior Distribution

    More specifically, our goal is to convert the prior distribution of r to a posterior distribution, \(P[r|\mathrm{Data}]\) , after observing the new data. We want to compute \(P[r|\mathrm{Data}]\) from \(P[r]\) and a new outcome. Bayes’s theorem comes to the rescue:
    \begin{equation*} P[r|\mathrm{Data}] = P[r]\frac{P[\mathrm{Data}|r]}{P[\mathrm{Data}]}. \end{equation*}
    We already know \(P[r]\) for a specific r value. We compute the conditional uncertainty \(P[\mathrm{Data}|r]\) (the probability that the new data was drawn from a sample with a distribution having the parameter r) exactly as we computed \(P[\mathrm{Data}|r,H_k]\) using the marginal distribution of r under \(H_k\) , knowing that for a given value of r, the outcome of the experiment (the percentage of times C was faster) has a binomial distribution. This is the same calculation we performed before, so we need not repeat it,
    \begin{equation*} P\left[\mathrm{Data} = \frac{n}{N}|r\right] = C(n, N)r^n(1-r)^{(N-n)}. \end{equation*}
    What remains is the calculation of \(P[\mathrm{Data}]\) , which simply is the marginal probability of observing a specific outcome under any r. This is simply
    \begin{equation*} P[\mathrm{Data}] = \int _{r}{P[\mathrm{Data}|r]dr}. \end{equation*}
    Because r has a discrete distribution, the above integral again reduces to a sum in our case,
    \begin{equation*} P\left[\mathrm{Data} = \frac{17}{28}\right] = 0.0134 + 0.0800 + 0.1525 + 0.0885 + 0.0099 = 0.3442. \end{equation*}
    With this result in hand, we have
    \begin{equation*} \begin{split} P\left[r = 0.4|\mathrm{Data} = \frac{17}{28}\right] = \frac{(0.2)(0.0134)}{0.0688} = 0.0389 \\ P\left[r = 0.5|\mathrm{Data} = \frac{17}{28}\right] = \frac{(0.2)(0.0800)}{0.0688} = 0.2324 \\ P\left[r = 0.6|\mathrm{Data} = \frac{17}{28}\right] = \frac{(0.2)(0.1525)}{0.0688} = 0.4429 \\ P\left[r = 0.7|\mathrm{Data} = \frac{17}{28}\right] = \frac{(0.2)(0.0885)}{0.0688} = 0.2571 \\ P\left[r = 0.8|\mathrm{Data} = \frac{17}{28}\right] = \frac{(0.2)(0.0099)}{0.0688} = 0.0288. \end{split} \end{equation*}
    You can check the above probabilities this time add up to 1, and in fact, \(P[r|\mathrm{Data}]\) is a proper probability distribution when the observation is fixed. The resulting posterior model, or posterior distribution \(P[r|\mathrm{Data}]\) , is shown in Figure 5. Note how the new data transformed the discrete prior model from a uniform distribution to one with a bell-shaped outline and a slight skew.
    Fig. 5.
    Fig. 5. Posterior model represented by the distribution of r, given an outcome of 17 C programs turning out faster in an experiment with 28 trials.

    6.2 Answering Additional Questions Using the Posterior Distribution

    With the posterior model, we can also answer some additional interesting questions, which we are not normally able to answer using the frequentist approach.
    For example, given the result, what is the probability that r is between 0.5 and 0.7 (corresponding to a range that overlaps with both \(H_0\) and \(H_1\) )? (Answer: \(\sim\) \(93.2\%.\) )
    Or conversely, what is the minimum range of plausible r values around the observed value of \(17/28\) with a credibility of at least \(95\%\) ? (Answer: between 0.5 and 0.7.) Such a range is referred to as a credible interval in Bayesian analysis. It is similar to a confidence interval in the frequentist approach, but with a more straightforward interpretation.

    6.3 Posterior Odds

    Given the prior model, we knew the prior odds of \(H_1\) to \(H_0\) . It was 1.5. Now we can calculate the posterior odds, the odds after observing the new data, which we write as
    \begin{equation*} \mathrm{Posterior Odds}_{10} = \frac{P[H_1|\mathrm{Data}]}{P[H_0|\mathrm{Data}]}. \end{equation*}
    We can calculate the above quantity in two ways: (1) as we did for the prior model, directly by using the derived posterior probabilities or (2) indirectly using the Bayes’s theorem once again. Let us use method 2, because it is more revealing: It tells us how the Bayes factor can be used to update the prior odds to posterior odds,
    \begin{equation*} \frac{P[H_1|\mathrm{Data}]}{P[H_0|\mathrm{Data}]} = \frac{P[H_1]}{P[H_0]} \frac{P[\mathrm{Data}|H_1]}{P[\mathrm{Data}|H_0]}. \end{equation*}
    The term \(P[H_1]/P[H_0]\) is the prior odds that we calculated before. The last term is the Bayes factor. Thus, we have
    \begin{equation*} \mathrm{Posterior\ Odds} = \mathrm{Prior\ Odds} \times \mathrm{Bayes\ Factor}. \end{equation*}
    The above equation simply states that the Bayes factor converts prior odds to posterior odds: It is in essence an updating factor. For the initial experiment, \(\mathrm{BF}_{10}\) equals 1.7909. Therefore, we have
    \begin{equation*} \mathrm{Posterior\ Odds}_{10} = 1.5 \times 1.7909 = 2.6863. \end{equation*}
    We now have a new model, plus the odds have improved in favor of \(H_1\) after the initial experiment.
    You can check that if we calculate the posterior odds directly from the posterior distribution, then we would get the same result.
    However, we do not care much about the posterior odds to render a verdict, at least not as much as we care about the Bayes factor. If the r values that support \(H_0\) have zero probabilities in the prior model (i.e., if \(P[r = 0.4]\) and \(P[r = 0.5]\) were both zero in the prior model), then prior and posterior odds would be infinite forever. If this is not the case, but the evidence remains in favor of \(H_1\) , then even weakly, posterior odds would grow exponentially with each new experiment when a posterior model is used as a prior model for a subsequent experiment. Although posterior odds do not deserve much attention because of this rapidly compounding effect, the incremental nature of the Bayesian approach underlying this behavior is really useful and worth exploring next.

    7 Building Evidence Through Replications

    Another major benefit of Bayesian analysis is its support for replicated studies. Replications are important for scientific advancement in any empirical field, and software engineering research is no exception [11].
    We have some initial, weak evidence supporting \(H_1\) . Right now, we cannot find more suitable Python programs or expert programmer resources for implementing the C versions for comparison. But we are planning further studies or replications of the same experiment with additional samples of Python programs when they are available and when we have additional qualified programmers at our disposal to implement the C versions. Perhaps other researchers have similar plans to do the same. We can easily do this with the Bayesian approach and keep updating the prior knowledge. We know we can stop when the evidence stabilizes. Stability will happen over time if the new information produced by each replication is not too erratic and follow a general trend. With the Bayesian approach we have
    \begin{equation*} \mathrm{Evidence}_t = f(\mathrm{Evidence}_{t - 1}, \textrm {Knowledge}_t), \end{equation*}
    where \(\mathrm{Evidence}_t\) is the the collective evidence available at time t. This evidence is inferred by updating the previous evidence, \(\mathrm{Evidence}_{t-1}\) , from time \(t-1\) with \(\mathrm{Knowledge}_t\) , the new knowledge just obtained. The function f is an incremental update function.
    The frequentist approach does not have this incremental advantage, as evidence can only be combined through secondary studies: We have to either re-analyze all the available data by pooling the samples from the different replications into a mega-sample (which is often problematic, because statistical soundness can be easily compromised) or synthesize the results using meta-analysis techniques. In either case, the approach is non-incremental, since we would be aggregating difference pieces of evidence, not updating existing packaged evidence (with all previous knowledge smashed into it) with new knowledge in a piecemeal manner. With the frequentist approach, we thus have
    \begin{equation*} \mathrm{Evidence}_t = F\lbrace \mathrm{Knowledge}_k \mid k = 1 \ldots t\rbrace , \end{equation*}
    where F is an aggregation function. Typically, F weights each piece of knowledge according to a “quality” measure such as the sample size underlying the knowledge or the inverse of the sample’s variance. Also here, typically, we cannot express \(\mathrm{Evidence}_t\) recursively through a simple update function.

    7.1 Research Trajectory: More Datasets, More Samples

    Let us say along the way we find additional, perhaps more suitable, datasets of Python programs. We plan subsequent replications of the same experiment with these new datasets as we discover them (we do not want to re-sample the datasets we have already used to keep the new samples homogeneous within themselves.) We will do this opportunistically; however, it may get more and more difficult to find good samples so our sample sizes may not be as big as we want them to be. We may not find all the new Python samples at once. Or we may not have the resources to re-implement the newly found Python samples in C all at once: So we may need to pursue the effort opportunistically and spread it over time. Whatever the reason, in the end, suppose we manage to conduct seven such replications with different samples of Python programs and with different sample sizes as more suitable Python programs are written and discovered. After the last replication, our research trajectory looks like the one in Table 4.
    Table 4.
    StudyAcronym# of trialsFreq. C faster
    (N)(n)
    Initial ExperimentExp2817
    Replication 1Rep12519
    Replication 2Rep2128
    Replication 3Rep3129
    Replication 4Rep4118
    Replication 5Rep5107
    Replication 6Rep654
    Table 4. Research Trajectory with Subsequent Replications, Sample Sizes, and Results
    Incidentally, the sample for the last replication is quite small, even though it may be a pretty good sample as far as representative Python programs go. This does not bode well for the frequentist approach where small sample sizes are inherently problematic, but it is totally okay with the Bayesian approach. How do we proceed? I’ll show using both approaches.

    7.2 Analyzing Replications with the Frequentist Approach

    If we were taking the frequentist approach, then we would simply reapply the same significance test we had applied in analyzing the initial experiment. We would have to apply the test to each replication independently. Suppose we have done this, and the results look like the data in Table 5. We include the initial experiment in the first row for comparison purposes. The table’s last row represents the uncertainty of the observed proportion, computed as the difference between the maximum observed r value of 80% and the lower confidence limit observed for a study. For replications 2–6, the rule of thumb for using the normal distribution in the hypothesis test is technically violated, but let us overlook this so that we can provide confidence intervals to assess how the observed proportion’s uncertainty evolves over time.4 However, for replication 7, we cannot easily overlook the tiny sample size, so we properly compute an exact p-value from the binomial distribution. Because the binomial distribution was used instead of the normal distribution, the confidence interval is not well defined for the last replication.
    Table 5.
    Study Freq.Prop.    Param.
    # ofCC Effect L.C.Uncert.
    trialsfasterfasterSig.SizeRej.Limit(Max.
    (N)(n)( \(r = n/N\) )(p-val)(O.R.) \(H_0\) ?(L.C.L.) \(-\) L.C.L.)
    Exp1281760.71%0.12841.5455No45.17%34.83%
    Rep1251976.00%0.00473.1667Yes59.55%20.45%
    Rep212866.67%0.12412.0000No42.93%37.07%
    Rep312975.00%0.04163.0000Yes51.26%28.74%
    Rep411872.73%0.06582.6667No47.93%32.07%
    Rep510770.00%0.10302.3333No43.99%36.01%
    Rep610880.00%0.02894.0000Yes53.99%26.01%
    Rep75480.00%0.1875*4.0000NoN/AN/A
    Table 5. NHST Results of the Initial Experiment and the Seven Subsequent Replications
    According to Table 5, only in three of the replications, \(H_0\) was rejected. The p-values were all over the place as shown in Figure 6. The last replication had a too-small sample size, which forced us to conduct the hypothesis test using the binomial distribution and with an exact p-value instead of the normal distribution with a z-score, thus preventing us from calculating a lower confidence limit. Moreover, this tweak in the test made the last result technically incomparable to previous replications (again, let us ignore this technicality for now.) But what is really most worrisome here is that the uncertainty of the observed proportion r, as measured by the difference between the maximum observed value in the sequence and the lower confidence limit (L.C.L.) for a specific experiment (last column of Table 5), did not gradually narrow with further replications and never stabilized (Figure 7). Remarkably, the magnitude of the effect, the O.R., improved in later experiments, but the occasionally high and erratic p-values cast doubt on the reliability of these results, suggesting some of the observed outcomes could have simply been due to chance.
    Fig. 6.
    Fig. 6. Statistical significance over time from subsequent replications using NHST. The dotted line marks the chosen one-tailed significance level, which equals \(1-\mathrm{Confidence\ Level}\) .
    Fig. 7.
    Fig. 7. Uncertainty of the observed proportion r over time based on confidence limits.
    The core problem is that none of the replications could build on previous results: They had to be independent. Since the individual results came from different datasets, and the samples were possibly collected under different conditions, here we cannot simply form a mega-sample by taking the union of the trials from all studies and then re-analyze the results using this mega-sample. If we could do that, then our analysis would be a straightforward pooling study, treating all Python programs written and collected at different times (ignoring how Python programs are written may have been evolving over time) as well as their C versions (possibly implemented by different expert programmers with different skills) as a quasi-random sample of the same large population. Alas, this naive approach of synthesizing knowledge is not considered to be statistically sound except in rare situations where not only study designs are meticulously repeated but also population characteristics (objects and subjects studied), data collection methods, and sample selection criteria are strictly controlled to make sure they remain constant over time and over the many replications. Our case does not fall under these rare situations as we have no such guarantees. Simple pooling as a synthesis method is widely considered unsound in an overwhelming majority of software engineering studies where many factors are difficult to control even within a single study. This situation in fact is not unique to software engineering: It is also applicable in many other fields [1].
    However, we could conduct another type of secondary study, a meta-analysis [16] to combine the results at the end. A meta-analysis is essentially a weighted aggregation of effects from different studies to calculate a combined effect. Just like simple pooling, it is non-incremental. A fully satisfactory exposition of meta-analysis techniques is beyond this article’s scope, since it is a rich topic that can occupy a whole book. We provide an example here only for illustration.
    A meta-analysis of the initial experiment and its seven replications are shown in Table 6 using a fixed-effects model using the logarithm of the O.R. as effect size,5 as is commonly done when synthesizing results from studies that deal with proportions. With ratios, such as the O.R., often the logarithm is taken to spread out the skewed distribution. The meta-analysis was conducted using the JASP software [21]. The fixed-effects model is the simplest meta-analysis model we can use: We can use it, because, in our case, we have no information on covariates or secondary factors that can influence the outcomes. The model weights each study by the information captured so that a study with a large variation contributes less to the combined effect than a study with a small variation. The low p-value for the so called omnibus test (the statistical significance test used for this kind of analysis) indicates that the combined effect is highly significant (there is a statistically detectable difference between the speed of C and Python programs). In the upper portion of the table, the high p-value for the heterogeneity test indicates that the effect size was sufficiently homogeneous across the studies and the use of the fixed-effects model was appropriate. The lower portion of the table shows the model’s only estimated coefficient (since we have no other factors), the intercept, along with a test of the estimate’s significance. The intercept is the meta-analytic estimate of the overall effect size as measured by the logarithm of the O.R., and the intercept’s z-test agrees with the omnibus test mentioned above. A 95% confidence interval is also given for the estimate.
    Table 6.
    Table 6. Meta-Analysis of the the Initial Experiment and Its Seven Applications Using the Logarithm of the O.R. as Effect Size
    A meta-analysis is usually accompanied by a forest plot, which is shown in Figure 8. The plot displays the means and confidence intervals of each study in the replication chain using a box-and-whisker plot, and the combined effect is indicated at the bottom with a separate box-and-whisker plot marking the overall effect’s estimate with the diamond. As a sanity check, we note that the diamond falls within the confidence interval of each study, confirming that the studies were homogeneous enough to warrant the use of a fixed-effects model.
    Fig. 8.
    Fig. 8. The forest plot of the meta-analysis for the initial experiment and its seven replications.
    If we conduct an eighth replication, then we would have to repeat the meta-analysis all over again with the nine studies in the replication chain rather than treating the current meta-analysis as a packaged summary of the accumulated knowledge and simply updating it with the new replication’s results. In other words, the approach is not incremental.

    7.3 What Else Can We Do?

    Although the meta-analysis seemingly gave us good information about the overall results, the nature of the meta-analysis implies that the information from the later studies with small sample sizes contributed very little to the combined effect. We could have discarded the smaller replications and survived the small of amount information they contained. The conditions under which the replications were conducted—especially the deterioration of sample sizes—were admittedly not ideal (in fact, I reduced the sample sizes deliberately to illustrate this point.) But what if, although it was getting harder and harder to find additional samples, our sample selection techniques improved over time with experience and in terms of the ability to identify programs to test that are increasingly more relevant for the kind of application we have in mind? Then we would not want to lose or undervalue the information contained in those small, but high-quality samples.
    Suppose in subsequent replications, with new datasets, and even in the face of dreadfully smaller samples, we could control some factor increasingly better. For example, we could automate the translation of a Python program to a C program to eliminate the effect of a human programmer. Or we could discover increasingly more relevant, if rare, sets of Python programs to sample. We would not want to give up our research endeavor in these circumstances just because we realize that a frequentist analysis is likely to prevent us from making much progress, and end up disappointing us despite the increasing quality of the samples.
    Therefore, there may be valid reasons to continue the research pursuit. We could obtain better insights—some insights—with the Bayesian approach, even if those insights turn out to be in support of the null hypothesis or converge to a weak consensus. We can make better decisions if we are increasingly better informed, compared to the “don’t know” style of conclusions prevalent in the frequentist approach. The added power comes from the Bayesian approach’s ability to handle incremental knowledge, building evidence gradually. Let us try it.

    7.4 Analyzing Replications Incrementally with the Bayesian Approach

    With Bayesian hypothesis testing, the first replication needs to build on its own prior knowledge. Where could this knowledge come from? Of course, it will come from the initial experiment. The C vs. Java knowledge we used in the initial experiment is now irrelevant: It has already been used in the initial experiment’s analysis and baked into its result, the posterior model. Hence, the initial experiment’s posterior model will serve as the first replication’s prior model. And the first replication’s posterior model will feed into the second replication’s analysis in the same way, and so on, as illustrated in Figure 9. The ability to build on previous results rather than starting from scratch is a huge advantage here.
    Fig. 9.
    Fig. 9. The incremental nature of the Bayesian approach. Subsequent studies build on previous ones.
    The results of this incremental process are shown in Table 7. The Bayes factors (BF \(_{10}\) ) show some fluctuations as new information is added, but after a few replications, they stay in the upper-weak and lower-moderate range of the interpretation scale (the lower plot in Figure 10). The important point to remember is that at each replication, the results contain information from all the previous replications and the (initial) prior, but the latter’s influence keeps diminishing. The influence of the prior eventually becomes relatively unimportant. Posterior odds keep rising exponentially with each favorable result, as usually expected (the upper plot in Figure 10).
    Fig. 10.
    Fig. 10. Evolution of the Bayes factors and posterior odds for the prior, initial experiment, and seven replications. The Bayes factor stabilizes reasonably, while the posterior odds increases exponentially.
    Table 7.
    Study Freq.Prop.   
    # ofCC Odds \(_{10}\) Param.
    trialsfasterfaster (PosteriorUncert.
    (N)(n)( \(r = n/N\) )BF \(_{10}\) Odds)(Stdev of r)
    PriorN/AN/AN/AN/A1.561.25%
    Exp1281760.71%1.79092.686338.99%
    Rep1251976.00%18.715050.27527.82%
    Rep212866.67%1.811891.08925.10%
    Rep312975.00%3.906355.86122.13%
    Rep411872.73%2.96041053.4919.76%
    Rep510770.00%2.18152298.2117.97%
    Rep610880.00%5.038211578.915.69%
    Rep75480.00%2.275426346.114.96%
    Table 7. Bayesian Analysis of the Initial Experiment and the Seven Subsequent Replications
    The last column gives the standard deviation of the prior model and posterior models, as a proxy for model uncertainty.
    If we examine the posterior distributions, or models, given in Figure 11, then we notice a converging behavior. The uniformly distributed prior model gradually converges to an increasingly peaked and symmetric bell shape. We can also notice this convergence from the standard deviations of the distributions, which rapidly fall, from above 60% in the original prior model to below 15% in the last replication’s posterior model (Figure 12). In fact, with multiple replications—as new information overrides old information—the influence of the original prior rapidly decreases, and the choice of the prior model becomes increasingly irrelevant.
    Fig. 11.
    Fig. 11. Evolution of the posterior distributions over multiple replications. The top left chart shows the prior distribution. The remaining charts give the posterior distributions of the initial experiment and the seven subsequent replications.
    Fig. 12.
    Fig. 12. Parameter uncertainty over time as measured by the standard deviation of r from the prior and posterior distributions. Uncertainty increases over time as more replications are performed.
    It is relatively easy to show that the sensitivity of the results to the choice of the prior model is low after many replications: Try re-analyzing the experiments with a differently shaped prior distribution. For example, instead of the flat, uniform distribution of the original analysis, we could use a peaked distribution, such as a discrete version of the triangular or the PERT distribution [14] having the same range of values.6 So let us modify the prior model’s distribution using the same five sampled r values of 40%, 50%, 60%, 70%, and 80% by assigning the new probabilities 0.05, 0.15, 0.8, 0.15, and 0.05, respectively. Note that the mean of this modified prior model is still 60% (since it is totally symmetric) and equals the mean of a corresponding continuous PERT distribution with a minimum, most-likely, and maximum values of 40%, 60%, and 80%, respectively:
    \begin{equation*} \mathrm{mean}_{\mathrm{Prior}} = \frac{(0.4 + 4(0.6) + 0.8)}{6} = 0.6 = 60\%. \end{equation*}
    If we reapplied the Bayesian analysis with the same replications, then the results would change as shown in Table 8, and the posterior distributions would evolve as shown in Figure 13. Note that the posterior distribution after the last replication still has its probability mass largely concentrated on the 70% value (with probability 0.74 vs. 0.88 in the original analysis), the probabilities of 40% and 50% values are still negligible, and the probabilities of the 60% and 80% values are more skewed toward the 60% level. The BFs and posterior odds after the last replication are very close: 2.28 vs. 2.15 in the original analysis. Therefore, after several replications, the conclusions do not change even with a radically different prior model as our starting point to represent the existing knowledge.
    Fig. 13.
    Fig. 13. Evolution of the posterior distributions over multiple replications with a PERT-based prior model. The top left chart shows the new prior distribution.
    Table 8.
    Study Freq.Prop.   
    # ofCC Odds \(_{10}\) Param.
    trialsfasterfaster (PosteriorUncert.
    (N)(n)( \(r = n/N\) )BF \(_{10}\) Odds)(Stdev of r)
    PriorN/AN/AN/AN/A437.42%
    Exp1281760.71%2.07688.307322.53%
    Rep1251976.00%11.524395.73523.22%
    Rep212866.67%1.8039172.69522.59%
    Rep312975.00%3.2797566.39223.45%
    Rep411872.73%2.66621510.1023.14%
    Rep510770.00%2.07923139.7622.50%
    Rep610880.00%4.348813654.320.25%
    Rep75480.00%2.153429403.118.76%
    Table 8. Bayesian Re-analysis of the Initial Experiment and the Seven Subsequent Replications using a PERT-based Prior

    8 Conclusions

    8.1 Further Reading, Tools, and Software Engineering Applications

    In this introductory article, we have barely scratched the surface on Bayesian data analysis through a simple hypothesis testing example involving proportions. Working with proportions is straightforward, but to answer harder research questions, we need more complicated study designs involving more complex properties, which in turn require more sophisticated analysis workflows and tools that can support them. Every frequentist data analysis technique has its Bayesian counterpart: Bayesian tools and techniques are diverse and plentiful.
    To begin with, Donovan and Mickey’s book [5] is a good choice for understanding the theory. The book can be coupled with JASP [21], an open-source, user-friendly desktop tool with a supportive community. JASP has many resources for beginners, including tutorials, videos, and workshops and a guide for students [10].
    For the reader wishing to get deeper into Bayesian analysis and learn about advanced concepts, analysis workflows for different kinds of research problems, parameter estimation, and sensitivity analysis, both McElreaths’s [15] and Kruschke’s texts [13] provide excellent detailed expositions with examples. Both books rely on R (https://www.r-project.org) packages and Stan (https://mc-stan.org/) as analysis tools. Kruschke also uses JAGS (https://mcmc-jags.sourceforge.io/), which is a Markov-Chain Monte Carlo sampler [25].
    In her doctoral dissertation, Chulani was one of the first to apply the Bayesian approach in the software engineering context [4], namely as a sounder way to calibrate the parameters of empirical cost estimation models in the COCOMO family based on prior, expert knowledge. A subsequent article by her and her collaborators [2] summarizes the main contribution of this work, where they take advantage of a well-known corollary of Bayes’s theorem that describes how the statistical properties of a prior distribution can update the statistical properties of a sample distribution. For more recent applications of Bayesian approaches in software engineering research, Torkar, Furia, Feldt and colleagues offer several articles, including a main one on general principles in this specific context [8]. These authors’ other articles address different, more nuanced topics, including applications to code quality and programming languages [9], increasing the practical significance of Bayesian analysis by contextual factors such as framing effects and risk preferences [24], and dealing with missing data and conducting sensitivity analysis and sanity checks [22].

    8.2 Takeaways

    Bayesian approaches are not a silver bullet but provide some distinct advantages over the traditional, frequentist approaches to hypothesis testing. Chief among them is the ability to accumulate evidence incrementally by building on previous results rather than replacing them or aggregating them through secondary studies. This advantage is particularly important in empirical software engineering research, given the increasing popularity of replication, experiment families, and collaborative evidence building [19].
    Additionally, Bayesian hypothesis testing breaks the awkward asymmetry between the null and alternative hypotheses present in NHST. It avoids some of the interpretational caveats of frequentist approaches, in particular concerning the misunderstanding and misuse of p-values, confidence intervals, and other statistics produced by NHST. Moreover, Bayesian analysis can liberate researchers from burdensome distributional assumptions that often rely on large samples and sometimes jeopardize the credibility of the results in parametric analysis. As sample sizes change or dwindle from one replication to another in a series of studies, distributional assumptions may get compromised, and the frequentist approach can force us to continuously switch from one test type to another, making the comparison among the different replications more problematic than they already are.
    On the downside, except in the simplest cases, Bayesian analysis does not lend itself to canned, well-known analytic solutions that are readily available and familiar to those with basic statistics and probability knowledge. Instead Bayesian analysis must rely on complex numerical sampling techniques that are encapsulated in computationally intensive tools.
    Another point of friction in Bayesian analysis is the necessity of the priors: the concept of priors, and the idea of defining models that represent existing knowledge about a construct being studied, can sometimes be puzzling. This caveat is alleviated by available guidelines for choosing models in the presence of different levels and qualities of prior knowledge, by sanity and sensitivity checks that assess the impact of model choices, and by the diminishing importance of the initial model choice in a sequence of studies that build evidence incrementally.

    Footnotes

    1
    Note that in this example we are dealing proportions: The property of interest is the proportion, hence the use of the binomial test, which ultimately reduces to a z-test, and we use the z-score as the test statistic behind it. In software engineering research, more commonly the property of interest is the mean, since we are often comparing the mean of a treatment group (e.g., the mean value of a productivity, performance, quality, or size measure) with a benchmark value or the mean of a control group. In these cases, the usual test statistic is the t-statistic, provided certain distributional assumptions hold for the population and the sample, and the hypothesis test employed is some version of the well-known t-test. Underlying the t-test is a t-distribution, which is similar to normal distribution but flatter and dependent on the sample size. When the sample size is large, it actually converges to the standard normal distribution and the t-statistic becomes indistinguishable from the z-score. The two frequentist approaches, the binomial test and all the variations of the t-test, work with the same logic apart from how the test statistic is computed.
    2
    Some software engineering researchers argue that 95% confidence is a low standard. The prevalence of the 95% level comes from accepted norms in social sciences, psychology, and some clinical settings, where human factors can constitute important co-variates, introducing excess variability that may be absent in the study of physical phenomena. In psychical sciences, the norm is usually more stringent, 99% or even higher. In software engineering research, the studied phenomena also often involve multiple human factors, in addition to complex technical factors—or the artifacts measured are created by humans in the presence of such factors. These factors include programmer skill, learning effects, mood, motivation, hidden incentives and disincentives, and software technologies that may interact among themselves and with other factors in unpredictable ways. Therefore, the 95% confidence level is usually deemed sufficient.
    3
    The O.R. is computed from the odds of the the two hypotheses. In this case, the odds of \(H_0\) equals the best-case expected benchmark proportion of 0.5 for faster C programs divided by the same expected benchmark proportion of 0.5 for faster Python programs. The odds of \(H_1\) is similar but computed with observed values: the observed proportion of faster C programs divided by the observed proportion for faster Python programs. The O.R. is the ratio of the two odds.
    4
    Normally, we would not overlook the violation of the normality assumption and instead switch to a different test, the exact binomial test, based on the binomial distribution, and compute an exact p-value for the test. However, having to switch the type of the hypothesis test in subsequent replications depending on the sample size poses another problem in the frequentist approach: Comparison among replications analyzed with different methods is questionable. This problem does not exist in the Bayesian approach.
    5
    In meta-analysis, we are concerned with the observed effect size (a relative measure with \(H_0\) being the reference point) rather than the observed property’s absolute value. In this case, the effect size is the O.R.; for the computation of O.R. for proportions, see footnote 3 in Section 3.4.
    6
    In the replication package, you can do this by simply changing the probabilities of r in the prior model on tab “Exp 5-Pt,” which represents the initial experiment with the five-point prior distribution.

    References

    [1]
    D. M. Bravata and I. Olkin. 2001. Simple pooling versus combining in meta-analysis. Eval. Health Profess. 24, 2 (2001), 218–230.
    [2]
    S. Chulani. 2001. Bayesian analysis of software cost and quality models. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’01). 565–568.
    [3]
    J. Cohen. 1992. A power primer. Psychol. Bull. 112, 1 (July1992), 155–150.
    [4]
    S. Devnani-Chulani. 1999. Bayesian Analysis of Software Cost and Quality Models. Ph.D. Dissertation. University of Southern California.
    [5]
    T. M. Donovan and R. M. Mickey. 2019. Bayesian Statistics for Beginners: A Step-by-Step Approach. Oxford University Press, Oxford, UK.
    [6]
    H. Erdogmus. 2021. Bayesian Hypothesis Testing vs NHST. Dataset. Retrieved from https://figshare.com/articles/dataset/Bayesian_Hypothesis_Testing_vs_NHST/14109770.
    [7]
    ESEM 2019. 2019. Retrieved February 20, 2021 from http://eseiw2019.com/esem/.
    [8]
    C. A. Furia, R. Feldt, and R. Torkar. 2021. Bayesian data analysis in empirical software engineering research. IEEE Trans. Softw. Eng. 47, 9 (2021), 1786–1810.
    [9]
    C. A. Furia, R. Torkar, and R. Feldt. 2021. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. arXiv:2101.12591 (2021). Retrieved from https://arxiv.org/abs/2101.12591.
    [10]
    M. A. Gross-Sampson, J. van Doorn, and E. J. Wagenmakers. 2019. Bayesian Inference in JASP: A Guide for Students. e-Book, OSF, Frankfurt, Germany.
    [11]
    N. Juristo and O. S. Gómez. 2012. Replication of software engineering experiments. In Empirical Software Engineering and Verification: International Summer Schools (LASER 2008-2010), B. Meyer and M. Nordio (Eds.). Springer, 60–88.
    [12]
    R. E. Kass and A. E. Raftery. 1995. Bayes factors. J. Am. Stat. Assoc. 90, 430 (June1995), 773–795.
    [13]
    J. Kruschke. 2015. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.). Academic Press/Elsevier, Amsterdam, Netherlands.
    [14]
    William J. McBride and Charles W. Mcclelland. 1967. PERT and the beta distribution. IEEE Trans. Eng. Manage. EM-14, 4 (1967), 166–169.
    [15]
    R. McElreath. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, Boca Raton, FL.
    [16]
    J. Miller. 2000. Applying meta-analytical procedures to software engineering experiments. J. Syst. Softw. 54, 1 (2000), 29–39.
    [17]
    R. S. Nickerson. 2000. Null hypothesis significance testing: A review of an old and continuing controversy. Psychol. Methods 5, 2 (June2000), 241–301.
    [18]
    A. Ortega and G. Navarrete. 2017. Bayesian hypothesis testing: An alternative to null hypothesis testing (NHST) in Psychology. In Bayesian Inference, J. P. Tejedor (Ed.). IntechOpen, London.
    [19]
    A. Santos, O. Gómez, and N. Juristo. 2020. Analyzing families of experiments in SE: A systematic mapping study. IEEE Trans. Softw. Eng. 46, 5 (2020), 566–583.
    [20]
    M. Sridharan and A. S. Namin. 2010. Bayesian methods for data analysis in software engineering. In Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering, Vol. 2. 477–478.
    [21]
    The JASP Team. 2021. JASP: A Fresh Way to Do Statistics. Retrieved Feb 20, 2021 from https://jasp-stats.org/.
    [22]
    R. Torkar, R. Feldt, and C. A. Furia. 2020. Bayesian data analysis in empirical software engineering: The case of missing data. In Contemporary Empirical Methods in Software Engineering, M. Felderer and G.H. Travassos (Eds.). Springer International Publishing, Cham, Switzerland, 289–324.
    [23]
    R. Torkar, C. A. Furia, and R. Feldt. 2021. Bayesian data analysis for software engineering. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE Computer Society, Los Alamitos, CA, 328–329.
    [24]
    R. Torkar, C. A. Furia, R. Feldt, F. Gomes de Oliveira Neto, L. Gren, P. Lenberg, and N. A. Ernst. 2021. A method to assess and argue for practical significance in software engineering. Unpublished (accepted to IEEE Trans. Softw. Eng., Jan 2021). 13 pages.
    [25]
    D. van Ravenzwaaij, P. Cassey, and S. D. A. Brown. 2018. A simple introduction to Markov chain monte-carlo sampling. Psychon. Bull. Rev. 25 (2018), 143–154.
    [26]
    R. L. Wasserstein and N. A. Lazar. 2016. The ASA statement on p-values: Context, process, and purpose. Am. Stat. 70, 2 (2016), 129–133.
    [27]
    J. Wroughton and T. Cole. 2013. Distinguishing between binomial, hypergeometric and negative binomial distributions. J. Stat. Educ. 21, 1 (2013), 16.

    Cited By

    View all
    • (2023)Efficient False Positive Control Algorithms in Big Data MiningApplied Sciences10.3390/app1308500613:8(5006)Online publication date: 16-Apr-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 55, Issue 6
    June 2023
    781 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3567471
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2022
    Online AM: 11 May 2022
    Accepted: 23 April 2022
    Revised: 01 April 2022
    Received: 25 August 2021
    Published in CSUR Volume 55, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bayesian statistics
    2. Bayesian inference
    3. Bayesian analysis
    4. Bayesian hypothesis testing
    5. frequentist analysis
    6. frequentist inference
    7. null hypothesis significance testing
    8. NHST
    9. empirical software engineering
    10. software engineering research

    Qualifiers

    • Tutorial
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,090
    • Downloads (Last 6 weeks)103
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Efficient False Positive Control Algorithms in Big Data MiningApplied Sciences10.3390/app1308500613:8(5006)Online publication date: 16-Apr-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media