Statistical error model

Coverage and pairwise allele frequency comparisons

Code

# load dependencies
library(ggplot2)
library(tidyverse)
library(purrr)
library(cowplot)
set.seed(172452)

Simulate a pairwise comparison between alleles at a single location with coverage error. To do this, we use a random binomial distribution that takes in coverage and true allele frequency and outputs the estimated allele frequency.

Code

# how to simulate coverage
## rbinom(1, cvg, AFtrue) / cvg
## output is estimated AF

cvg <- 20
trueAF <- 0.43

estAF <- rbinom(1,cvg,trueAF) / cvg
estAF

[1] 0.45

Function that calculates estimated allele frequency when it is given coverage and true allele frequency:

Code

# function calculating estimated allele frequency
estAF <- function(cvg,trueAF){
  est <- rbinom(1,cvg,trueAF) / cvg
  return(est)
}

Now, we can scale our code and plot the results:

Code

# run this n times, scale up 
n <- 1000
tru <- 0.43
cvg <- sample(2:150,n,replace = TRUE)
af <- tibble(
  "cvg" = cvg,
  "trueAF" = rep(tru,times=n)
)

af$estAF <- af %>% pmap(estAF) %>% unlist()

# plot estimated allele frequency by coverage
plt <- af %>% ggplot(aes(x = estAF, y = cvg)) +
  geom_point(alpha = 0.15,colour="red") +
  geom_smooth(method = "gam", colour = "red",se = FALSE) +
  geom_vline(xintercept=tru, size = 0.8,linetype = "dashed") +
  labs(x = "Estimated allele frequency", y = "Coverage") +
  annotate("label",x = tru, y = -50, label = "True allele frequency") +
  theme_cowplot()
plt

Figure 1: Variability in allele frequency estimations due to coverage.