Shannon’s Entropy:
$H(X) = -\sum_{x \in X} p(x) \log p(x) = E[-\log p(X)]$
(or $H(X) = -\int_{x \in X} p(x) \log p(x) dx$
)
An intuitive explaination:
For independent observations $x, y$
, their joint density is $p(x,y) = p(x)p(y)$
.
Want to obtain their “information” $h(x)$
with the following property:
-
Monotonic w.r.t.
$p(x)$
: the higher the density is , the more information we get -
Always positive
-
Additivity: information gained from
$(x,y)$
is the sum of information gained from$x$
and$y$
, i.e.$h(x+y) = h(x) + h(y)$
A natural choice will be $h(x) = - \log p(x)$
. Thus the entropy for the entire population is the weighted average amount of information: $H(X) = E[-\log p(X)] = -\sum_{x \in X} p(x) \log p(x)$
Properties
For discrete distribution taking $K$
distinct values, $H(X) \leq \log (K)$
with the equal sign holds iff all $K$
values have equal probability $\frac{1}{K}$
KL Divergence
Kullback-Lerbler divergence is used to measure relative entropy (distance) between two distributions $p(\cdot), q(\cdot)$
:
$KL(p || q) = - \int p(x) \log q(x) dx - [- \int p(x) \log p(x) dx] = -\int p(x) \log \frac{q(x)}{p(x)} dx$
The first part is the “entropy” of $q(\cdot)$
but using $p(\cdot)$
as the density, and the second part is the entropy for $p(\cdot)$
. That is why it’s called relative entropy.
Note:
-
$KL(p || q) = 0$
iff$q(x) = p(x)$
with probability 1 -
In general,
$KL(p||q) \neq KL(q || p)$