Shannon’s Entropy:
$H(X) = -\sum_{x \in X} p(x) \log p(x) = E[-\log p(X)]$
(or $H(X) = -\int_{x \in X} p(x) \log p(x) dx$ )
An intuitive explaination:
For independent observations $x, y$, their joint density is $p(x,y) = p(x)p(y)$.
Want to obtain their “information” $h(x)$ with the following property:
-
Monotonic w.r.t.
$p(x)$: the higher the density is , the more information we get -
Always positive
-
Additivity: information gained from
$(x,y)$is the sum of information gained from$x$and$y$, i.e.$h(x+y) = h(x) + h(y)$
A natural choice will be $h(x) = - \log p(x)$. Thus the entropy for the entire population is the weighted average amount of information: $H(X) = E[-\log p(X)] = -\sum_{x \in X} p(x) \log p(x)$
Properties
For discrete distribution taking $K$ distinct values, $H(X) \leq \log (K)$ with the equal sign holds iff all $K$ values have equal probability $\frac{1}{K}$
KL Divergence
Kullback-Lerbler divergence is used to measure relative entropy (distance) between two distributions $p(\cdot), q(\cdot)$:
$KL(p || q) = - \int p(x) \log q(x) dx - [- \int p(x) \log p(x) dx] = -\int p(x) \log \frac{q(x)}{p(x)} dx$
The first part is the “entropy” of $q(\cdot)$ but using $p(\cdot)$ as the density, and the second part is the entropy for $p(\cdot)$. That is why it’s called relative entropy.
Note:
-
$KL(p || q) = 0$iff$q(x) = p(x)$with probability 1 -
In general,
$KL(p||q) \neq KL(q || p)$