Computer lab: Bayesian analysis

Methods of Scientific Working (3502-450)

Author

Prof. Karl Schmid, University of Hohenheim

Published

December 17, 2024

library(ggplot2)

Background

The purpose of this computer lab is to simulate an urn experiment to demonstrate the key principles of a Bayesian analysis.

Then, as an application of the theory we will apply Bayesian reasoning to estimate a parameter (allele frequency) in a simple population genetic model.

The third examples is a Bayesian test of a hypothesis given the data observed in a sample. We will test the hypothesis that that genotype frequencies of a genetic locus are in Hardy Weinberg equilibrium.

Marble experiment

We first start with a simple marble experiment in which we compare two hypotheses about the number of blue and white marbles in an urn.

Design of the experiment

The experiment is set up by flipping a fair coin to select one of two possible hypothesis as true hypothesis.

Depending on the outcome, we put the marbles into the urn:

If head, place in an urn 1 white and 3 blue marbles
If tail, place in an urn 3 white and 1 blue marble

From this, we generate two hypotheses, we will test:

H\(_B\): 1 white and 3 blue marbles (WBBB)
H\(_W\): 3 white and 1 blue marble (WWWB)

The goal of the Bayesian analysis is to test which of the hypotheses is probably true (or more likely true) given the data.

To find out, we conduct an experiment:

Mix the marbles in the urn
Draw a marble and record its color
Put it back into the urn
Repeat the procedure as necessary

Such an experiment can be easily set up with R

The goal of the setup is to find out which of the two hypotheses have a higher probability given the data. We flip the coin first to randomly choose between one of the hypotheses. Because we do not know the outcome of the coin toss, we assume that both hypotheses are equally likely and for this reason the prior probability is 1:1 for both hypotheses.

coin <- c("head", "tail")
draw <- sample(coin, 1)
if (draw == "head") {
  urn <- c('W', 'B', 'B', 'B')
} else {
  urn <- c('W', 'W', 'W', 'B')
}

We have now randomly constructed an urn, from which we can sample. To run our experiment, we need to calculate the posterior probability. Since we are comparing two hypotheses, we can use the ratio form of Bayes’s rule,

\[\frac{P(H_B | D)}{P(H_W | D)} = \frac{P(D | H_B)}{P(D | H_W)} \times \frac{P(H_B)}{P(H_W)}\]

The first term describes the ratio of posterior probabilities, the second the likelihood odds ratio, and the third the ratio of the prior probabilities.

It is now possible to construct the likelihoods. If hypothesis \(H_B\) is true, the probability of obtaining a blue draw is 0.75 (three out of four marbles are blue). Under hypothesis \(H_W\) it is 0.25. Therefore the likelihood odds ratio is

\[\frac{P(D|H_B)}{P(D|H_W)} = \frac{0.75}{0.25} = \frac{3}{1}.\]

If we draw a white marble, the likelihood odds ratio is

\[\frac{P(D|H_B)}{P(D|H_W)} = \frac{0.25}{0.75} = \frac{1}{3}.\]

For the first draw, the prior probability of both hypotheses is \(P(H_B) = P(H_W) = 0.5\), and we can calculate the posterior probabilities if we drew a blue marble as

\[\frac{P(H_B | D)}{P(H_W | D)} = \frac{0.75}{0.25} \times \frac{0.5}{0.5} = \frac{3}{1}\]

The posterior probability of the first draw is the prior probability of the second draw. Since we are only interested in the ratio of the new prior probabilities, we can just set

\[\frac{P(H_B)}{P(H_W)}=\frac{3}{1}\]

as the new prior ratio and start from new.

If the second draw is a white marble, the posterior ratio is then

\[\frac{P(H_B | D)}{P(H_W | D)} = \frac{0.25}{0.75} \times \frac{3}{1} = \frac{0.75}{0.75} = \frac{1}{1}\]

and so on. We can implement this experiment now in R:

    urnB <- c('W','B','B','B')
    urnW <- c('W','W','W','B')

    priorB <- 0.5 # prior probability of hypothesis B
    priorW <- 0.5 # prior probability of hypothesis W
    
    # randomly select the urn (i.e., the true hypothesis)
    hypothesis <- sample(c('H_B', 'H_W'), 1)
    if (hypothesis == 'H_B') { 
      urn <- c('W','B','B','B') 
    } else if (hypothesis == 'H_W')  {
      urn <- c('W','W','W','B') 
    }
    print(urn)

[1] "W" "W" "W" "B"

    stoppingrule <- 0.999999

    # set up experiment
    i <- 0
    draws <- c(i)
    results <- c("(Prior)")
    ratio <- c("1:1")
    posteriorB <- c(priorB)
    posteriorW <- c(priorW)
    
    while (priorB < stoppingrule && priorW < stoppingrule) {
      draw <- sample(urn, 1) # Randomly choose the true hypothesis
      if (draw == 'B') {
        likelihoodB <- 0.75
        likelihoodW <- 0.25
      } else {
        likelihoodB <- 0.25
        likelihoodW <- 0.75
      }
      
      # Bayes' theorem for odds ratios
      posteriorRatio <- (likelihoodB / likelihoodW) * (priorB / priorW)
      posteriorInteger <- paste(as.integer(posteriorRatio), ": 1")
      priorB <- posteriorRatio / (posteriorRatio + 1) #see footnote
      priorW <- 1 - priorB
      
      i = i + 1 # increase the counter for the next draw
      
      draws <- c(draws, i)
      results <- c(results, draw)
      ratio <- c(ratio, posteriorInteger)
      posteriorB <- c(posteriorB, priorB)
      posteriorW <- c(posteriorW, priorW)
    }

# put the data into a dataframe and plot the outcome
df = data.frame(draws, H_B = posteriorB, H_W = posteriorW)  

# Melting the dataframe to make it suitable for ggplot
df_long <- reshape2::melt(df, id.vars = 'draws', measure.vars = c('H_W', 'H_B'))

# Plotting
ggplot(df_long, aes(x = draws, y = value, color = variable)) +
  geom_line() +
  geom_point() +
  labs(title = paste("True hypothesis:", hypothesis, "; Stopping rule: Posterior probability >", stoppingrule), x = "Number of draws until stoping rule", y = "Posterior probability", color = "Hypothesis") +
  ylim(0, 1) +
  scale_color_manual(values = c("H_B" = "blue", "H_W" = "black"))

Excercises

Run a single simulation run using this code and inspect the output
Repeat the simulation run. How does the output change?
Change the prior to check out how much the experiment depends on the prior used in the early stage of the experiment
The ratio of the posterior probabilities give the relative believe in each of the hypotheses. Check how the ratio is effected when the true hypothesis differs from the one which are tested.

Study questions

Why can we use the posterior probability of the last experiment as the prior of the next experiment?
How would the calculations be changed if we have three different marble colors instead of two different marbles?