Entropy - Zhili's Website website

Shannon’s Entropy:

$H(X) = -\sum_{x \in X} p(x) \log p(x) = E[-\log p(X)]$

(or $H(X) = -\int_{x \in X} p(x) \log p(x) dx$ )

An intuitive explaination:

For independent observations $x, y$ , their joint density is $p(x,y) = p(x)p(y)$ .

Want to obtain their “information” $h(x)$ with the following property:

Monotonic w.r.t. $p(x)$ : the higher the density is , the more information we get
Always positive
Additivity: information gained from $(x,y)$ is the sum of information gained from $x$ and $y$ , i.e. $h(x+y) = h(x) + h(y)$

A natural choice will be $h(x) = - \log p(x)$ . Thus the entropy for the entire population is the weighted average amount of information: $H(X) = E[-\log p(X)] = -\sum_{x \in X} p(x) \log p(x)$

Properties

For discrete distribution taking $K$ distinct values, $H(X) \leq \log (K)$ with the equal sign holds iff all $K$ values have equal probability $\frac{1}{K}$

KL Divergence

Kullback-Lerbler divergence is used to measure relative entropy (distance) between two distributions $p(\cdot), q(\cdot)$ :

$KL(p || q) = - \int p(x) \log q(x) dx - [- \int p(x) \log p(x) dx] = -\int p(x) \log \frac{q(x)}{p(x)} dx$

The first part is the “entropy” of $q(\cdot)$ but using $p(\cdot)$ as the density, and the second part is the entropy for $p(\cdot)$ . That is why it’s called relative entropy.

Note:

$KL(p || q) = 0$ iff $q(x) = p(x)$ with probability 1
In general, $KL(p||q) \neq KL(q || p)$