What is False Discovery Rate (FDR)
| Multiple Tests | Null | Alternative |
|---|---|---|
| Not Reject | TN | FN |
| Reject | FP | TP |
$FDR = E[\frac{FP}{FP + TP}]$.
$(FP + TP)$ are denoted as “discoveries”, FDR is the expected rate of false outcomes in all discoveries.
How to Control FDR (the B-H procedure)
Suppose a total of m tests are conducted. Rank them by p-values low to high, denote as $H_1, ..., H_m$ and $p_1, ..., p_m$. Suppose the desired confidence level is $\alpha$.
Find largest $k$ with $p_k \leq \frac{k}{m} * \alpha$. Reject all hypothesis $H_1, ..., H_k$. Then we can have $E(FDR) \leq \frac{m_0}{m} \alpha \leq \alpha$ where $m_0$ is the number of real null hypothesis.
Visualization:
library(ggplot2)
library(dplyr)
library(gridExtra)
set.seed(111)
alpha = 0.05
a = runif(1000)
b = rexp(500, rate = 60)
b = b[b<1]
data = data.frame(value = c(a, b), reject = c(rep(FALSE, length(a)), rep(TRUE, length(b))))
data = data[order(data$value), ]
data$x = 1:nrow(data)
BH_max = max(which(data$value <= (0.05 * data$x / nrow(data))))
p1 = ggplot(data) + geom_histogram(aes(x = value, fill = reject), color = 'white') +
labs(title = 'p values') + theme_bw()
p2 = ggplot(data) +
geom_point(aes(x = x, y = value, color = reject)) +
geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') +
geom_vline(xintercept = BH_max) +
ylim(0, 0.01) + xlim(0, 200) +
labs(title = '0 ~ 200 smallest p values', y = 'p value') + theme_bw()
p3 = ggplot(data) +
geom_point(aes(x = x, y = value, color = reject)) +
geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') +
geom_vline(xintercept = BH_max) +
labs(title = 'All p values', y = 'p value') + theme_bw()
grid.arrange(p1, p2, p3, nrow = 1)
Why B-H Procedure Works?
The Basic idea is that true null hypothesis have p-values following a Uniform(0, 1) distribution, while true alternative hypothesis have p-values skewed to right (dense around zero). In other words p-values of true nulls have pdf’s flat, while true alternatives have pdf’s decreasing on $[0, 1]$
(B-H procedure ensures the False Discovery Rate at the cost of ignoring many true positive hypothesis. )
See this video for intuition. For detailed proof, see Benjamini & Hochberg, 1995