What is False Discovery Rate (FDR)
Multiple Tests | Null | Alternative |
---|---|---|
Not Reject | TN | FN |
Reject | FP | TP |
$FDR = E[\frac{FP}{FP + TP}]$
.
$(FP + TP)$
are denoted as “discoveries”, FDR is the expected rate of false outcomes in all discoveries.
How to Control FDR (the B-H procedure)
Suppose a total of m
tests are conducted. Rank them by p-values low to high, denote as $H_1, ..., H_m$
and $p_1, ..., p_m$
. Suppose the desired confidence level is $\alpha$
.
Find largest $k$
with $p_k \leq \frac{k}{m} * \alpha$
. Reject all hypothesis $H_1, ..., H_k$
. Then we can have $E(FDR) \leq \frac{m_0}{m} \alpha \leq \alpha$
where $m_0$
is the number of real null hypothesis.
Visualization:
library(ggplot2)
library(dplyr)
library(gridExtra)
set.seed(111)
alpha = 0.05
a = runif(1000)
b = rexp(500, rate = 60)
b = b[b<1]
data = data.frame(value = c(a, b), reject = c(rep(FALSE, length(a)), rep(TRUE, length(b))))
data = data[order(data$value), ]
data$x = 1:nrow(data)
BH_max = max(which(data$value <= (0.05 * data$x / nrow(data))))
p1 = ggplot(data) + geom_histogram(aes(x = value, fill = reject), color = 'white') +
labs(title = 'p values') + theme_bw()
p2 = ggplot(data) +
geom_point(aes(x = x, y = value, color = reject)) +
geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') +
geom_vline(xintercept = BH_max) +
ylim(0, 0.01) + xlim(0, 200) +
labs(title = '0 ~ 200 smallest p values', y = 'p value') + theme_bw()
p3 = ggplot(data) +
geom_point(aes(x = x, y = value, color = reject)) +
geom_abline(intercept = 0, slope = 0.05/nrow(data), linetype = 'dashed') +
geom_vline(xintercept = BH_max) +
labs(title = 'All p values', y = 'p value') + theme_bw()
grid.arrange(p1, p2, p3, nrow = 1)
Why B-H Procedure Works?
The Basic idea is that true null hypothesis have p-values following a Uniform(0, 1) distribution, while true alternative hypothesis have p-values skewed to right (dense around zero). In other words p-values of true nulls have pdf’s flat, while true alternatives have pdf’s decreasing on $[0, 1]$
(B-H procedure ensures the False Discovery Rate at the cost of ignoring many true positive hypothesis. )
See this video for intuition. For detailed proof, see Benjamini & Hochberg, 1995