Study notes--Information Measures and Divergences

Study notes for different families of information measures and divergences.

Core definitions

Definition 0 (Information Measures)

Let \(\mathcal{X}\) be a measurable space with underlying \(\sigma-\)algebra \(\mathcal{F}\), and let \(\mathcal{M}(\mathcal{X})\) be the associated family of probability measures (distributions) on \(\mathcal{X}\). An information measure is a mapping \(M: \mathcal{M}(\mathcal{X}) \times \mathcal{M}(\mathcal{X}) \to \mathbb{R}\). In addition to information measures between probability measures, we define pointwise information measure as \(M_P: \mathcal{M}(\mathcal{X})^2 \times \mathcal{X}^2 \to \mathbb{R}\).

There are no assumptions on the properties of information measures nor pointwise information measure. However, often we assume \(M(P,P)=0\) for the same two distributions.

Definition 1 (Divergences)

A divergence \(D\) is an information measure that satisfies

Let \(P\) and \(Q\) be two probability measures on a measurable space \((\mathcal{X}, \mathcal{F})\), with \(P \ll Q\). There are some families of divergences worthing investigating.

Definition 2 (\(f\)-divergence)

Let \(f: (0,\infty)\to \mathbb{R}\) be a convex function with \(f(1)=0\). The \(f\)-divergence between \(P\) and \(Q\) is defined as

\[D_f(P \| Q) = \int_\mathcal{X} f(\frac{dP}{dQ})dQ,\]

where \(\frac{dP}{dQ}\) is the Radon-Nikodym derivative between \(P\) and \(Q\).

Some examples of \(f-\)divergences

Name\(f(t)\)Expression \(D_f(P | Q)\)Notes
Kullback–Leibler (KL)\(f(t) = t \log t\)\(\int \log\!\left(\tfrac{dP}{dQ}\right) \, dP\)Most common divergence; limit of Rényi as \(\alpha \to 1\).
Reverse KL\(f(t) = -\log t\)\(\int -\log\!\left(\tfrac{dP}{dQ}\right) \, dQ\)Asymmetric; used in variational inference.
Total Variation (TV)\(f(t) = \tfrac{1}{2}\vert t-1\vert\)\(\tfrac{1}{2} \int \vert dP - dQ\vert\)Metric; max probability of distinguishing \(P\) vs \(Q\).
χ²-divergence\(f(t) = (t-1)^2\)\(\int \tfrac{(dP-dQ)^2}{dQ}\)Related to Pearson’s χ² test statistic.
Hellinger divergence\(f(t) = (\sqrt{t}-1)^2\)\(2\!\left(1 - \int \sqrt{dP\, dQ}\right)\)Square of Hellinger distance.
Jensen–Shannon (JS)\(f(t) = t \log t - (t+1)\log\!\tfrac{t+1}{2}\)\(\tfrac{1}{2}\mathrm{KL}(P|M)+\tfrac{1}{2}\mathrm{KL}(Q|M), \; M=\tfrac{1}{2}(P+Q)\)Symmetric, bounded by \(\log 2\).
Triangular Discrimination\(f(t) = \tfrac{(t-1)^2}{t+1}\)\(\int \tfrac{(P-Q)^2}{P+Q}\)Related to Bhattacharyya distance.
Squared Hellinger (Bhattacharyya)\(f(t) = 1 - \sqrt{t}\)\(-\log \int \sqrt{dP\, dQ}\)Used in pattern recognition.
α-divergence (Amari)\(f(t) = \tfrac{t^\alpha - \alpha(t-1) -1}{\alpha(\alpha-1)}, \; \alpha \neq 0,1\)General family interpolating KL and reverse KLConnects to Rényi divergences.

Properties of \(f\)-divergences

In fact, any divergence functional that is nonnegative, satisfies DPI, and depends only on the Radon-Nikodym derivative \(\frac{dP}{dQ}\), must be an \(f-\)divergence.

Definition 3 (Rényi divergence)

Let \(\alpha\in (0,1)\cup (1, \infty)\) be a adjustable hyperparameter. The Rényi divergence of order \(\alpha\) is defined as

\[D_\alpha(P \| Q) = \frac{1}{\alpha-1}\log \int_\mathcal{X} (\frac{dP}{dQ})^\alpha dQ,\]

Rényi Divergence for Common Values of α

αExpressionNotes / Connection
0\(D_0(P | Q) = -\log Q(\operatorname{supp}(P))\)Depends only on the support of \(P\); called min-divergence.
0.5\(D_{0.5}(P | Q) = -2 \log \int \sqrt{dP \, dQ}\)Related to Hellinger / Bhattacharyya coefficient; symmetric.
1\(\lim_{\alpha \to 1} D_\alpha(P | Q) = D_{\mathrm{KL}}(P | Q) = \int dP \log \frac{dP}{dQ}\)Standard KL divergence.
2\(D_2(P | Q) = \log \int \frac{(dP)^2}{dQ}\)Related to χ²-divergence; sometimes called collision divergence.
\(D_\infty(P | Q) = \log \operatorname{ess\,sup}_x \frac{dP}{dQ}(x)\)Maximum ratio of densities; called max-divergence.