2 min read

Entropy

Shannon’s Entropy:

$H(X) = -\sum_{x \in X} p(x) \log p(x) = E[-\log p(X)]$

(or $H(X) = -\int_{x \in X} p(x) \log p(x) dx$ )

An intuitive explaination:

For independent observations $x, y$, their joint density is $p(x,y) = p(x)p(y)$.

Want to obtain their “information” $h(x)$ with the following property:

  • Monotonic w.r.t. $p(x)$: the higher the density is , the more information we get

  • Always positive

  • Additivity: information gained from $(x,y)$ is the sum of information gained from $x$ and $y$, i.e. $h(x+y) = h(x) + h(y)$

A natural choice will be $h(x) = - \log p(x)$. Thus the entropy for the entire population is the weighted average amount of information: $H(X) = E[-\log p(X)] = -\sum_{x \in X} p(x) \log p(x)$

Properties

For discrete distribution taking $K$ distinct values, $H(X) \leq \log (K)$ with the equal sign holds iff all $K$ values have equal probability $\frac{1}{K}$

KL Divergence

Kullback-Lerbler divergence is used to measure relative entropy (distance) between two distributions $p(\cdot), q(\cdot)$:

$KL(p || q) = - \int p(x) \log q(x) dx - [- \int p(x) \log p(x) dx] = -\int p(x) \log \frac{q(x)}{p(x)} dx$

The first part is the “entropy” of $q(\cdot)$ but using $p(\cdot)$ as the density, and the second part is the entropy for $p(\cdot)$. That is why it’s called relative entropy.

Note:

  • $KL(p || q) = 0$ iff $q(x) = p(x)$ with probability 1

  • In general, $KL(p||q) \neq KL(q || p)$