2 min read

FDR Control

What is False Discovery Rate (FDR)

Multiple Tests Null Alternative
Not Reject TN FN
Reject FP TP

$FDR = E[\frac{FP}{FP + TP}]$.

$(FP + TP)$ are denoted as “discoveries”, FDR is the expected rate of false outcomes in all discoveries.

How to Control FDR (the B-H procedure)

Suppose a total of m tests are conducted. Rank them by p-values low to high, denote as $H_1, ..., H_m$ and $p_1, ..., p_m$. Suppose the desired confidence level is $\alpha$.

Find largest $k$ with $p_k \leq \frac{k}{m} * \alpha$. Reject all hypothesis $H_1, ..., H_k$. Then we can have $E(FDR) \leq \frac{m_0}{m} \alpha \leq \alpha$ where $m_0$ is the number of real null hypothesis.

Visualization:

library(ggplot2)
library(dplyr)
library(gridExtra)

set.seed(111)
alpha = 0.05
a = runif(1000)
b = rexp(500, rate = 60)
b = b[b<1]
data = data.frame(value = c(a, b), reject = c(rep(FALSE, length(a)), rep(TRUE, length(b))))
data = data[order(data$value), ]
data$x = 1:nrow(data)

BH_max = max(which(data$value <= (0.05 * data$x / nrow(data))))

p1 = ggplot(data) + geom_histogram(aes(x = value, fill = reject), color = 'white') + 
  labs(title = 'p values') + theme_bw()

p2 = ggplot(data) + 
  geom_point(aes(x = x, y = value, color = reject)) + 
  geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') + 
  geom_vline(xintercept = BH_max) + 
  ylim(0, 0.01) + xlim(0, 200) + 
  labs(title = '0 ~ 200 smallest p values', y = 'p value') + theme_bw()
  
p3 = ggplot(data) + 
  geom_point(aes(x = x, y = value, color = reject)) + 
  geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') + 
  geom_vline(xintercept = BH_max) + 
  labs(title = 'All p values', y = 'p value') + theme_bw()
  
grid.arrange(p1, p2, p3, nrow = 1)

Why B-H Procedure Works?

The Basic idea is that true null hypothesis have p-values following a Uniform(0, 1) distribution, while true alternative hypothesis have p-values skewed to right (dense around zero). In other words p-values of true nulls have pdf’s flat, while true alternatives have pdf’s decreasing on $[0, 1]$

(B-H procedure ensures the False Discovery Rate at the cost of ignoring many true positive hypothesis. )

See this video for intuition. For detailed proof, see Benjamini & Hochberg, 1995