Study notes--Information Measures and Divergences
Study notes for different families of information measures and divergences.
Core definitions
Definition 0 (Information Measures)
Let \(\mathcal{X}\) be a measurable space with underlying \(\sigma-\)algebra \(\mathcal{F}\), and let \(\mathcal{M}(\mathcal{X})\) be the associated family of probability measures (distributions) on \(\mathcal{X}\). An information measure is a mapping \(M: \mathcal{M}(\mathcal{X}) \times \mathcal{M}(\mathcal{X}) \to \mathbb{R}\). In addition to information measures between probability measures, we define pointwise information measure as \(M_P: \mathcal{M}(\mathcal{X})^2 \times \mathcal{X}^2 \to \mathbb{R}\).
There are no assumptions on the properties of information measures nor pointwise information measure. However, often we assume \(M(P,P)=0\) for the same two distributions.
Definition 1 (Divergences)
A divergence \(D\) is an information measure that satisfies
- \(D(P\|Q)=0\) if and only if \(P=Q\) almost everywhere;
- \(D(P\|Q)\geq 0\) for any \(P \ll Q\) (i.e. \(P\) is absolutely continuous with respect to \(Q\)).
Let \(P\) and \(Q\) be two probability measures on a measurable space \((\mathcal{X}, \mathcal{F})\), with \(P \ll Q\). There are some families of divergences worthing investigating.
Definition 2 (\(f\)-divergence)
Let \(f: (0,\infty)\to \mathbb{R}\) be a convex function with \(f(1)=0\). The \(f\)-divergence between \(P\) and \(Q\) is defined as
\[D_f(P \| Q) = \int_\mathcal{X} f(\frac{dP}{dQ})dQ,\]where \(\frac{dP}{dQ}\) is the Radon-Nikodym derivative between \(P\) and \(Q\).
Some examples of \(f-\)divergences
| Name | \(f(t)\) | Expression \(D_f(P | Q)\) | Notes |
|---|---|---|---|
| Kullback–Leibler (KL) | \(f(t) = t \log t\) | \(\int \log\!\left(\tfrac{dP}{dQ}\right) \, dP\) | Most common divergence; limit of Rényi as \(\alpha \to 1\). |
| Reverse KL | \(f(t) = -\log t\) | \(\int -\log\!\left(\tfrac{dP}{dQ}\right) \, dQ\) | Asymmetric; used in variational inference. |
| Total Variation (TV) | \(f(t) = \tfrac{1}{2}\vert t-1\vert\) | \(\tfrac{1}{2} \int \vert dP - dQ\vert\) | Metric; max probability of distinguishing \(P\) vs \(Q\). |
| χ²-divergence | \(f(t) = (t-1)^2\) | \(\int \tfrac{(dP-dQ)^2}{dQ}\) | Related to Pearson’s χ² test statistic. |
| Hellinger divergence | \(f(t) = (\sqrt{t}-1)^2\) | \(2\!\left(1 - \int \sqrt{dP\, dQ}\right)\) | Square of Hellinger distance. |
| Jensen–Shannon (JS) | \(f(t) = t \log t - (t+1)\log\!\tfrac{t+1}{2}\) | \(\tfrac{1}{2}\mathrm{KL}(P|M)+\tfrac{1}{2}\mathrm{KL}(Q|M), \; M=\tfrac{1}{2}(P+Q)\) | Symmetric, bounded by \(\log 2\). |
| Triangular Discrimination | \(f(t) = \tfrac{(t-1)^2}{t+1}\) | \(\int \tfrac{(P-Q)^2}{P+Q}\) | Related to Bhattacharyya distance. |
| Squared Hellinger (Bhattacharyya) | \(f(t) = 1 - \sqrt{t}\) | \(-\log \int \sqrt{dP\, dQ}\) | Used in pattern recognition. |
| α-divergence (Amari) | \(f(t) = \tfrac{t^\alpha - \alpha(t-1) -1}{\alpha(\alpha-1)}, \; \alpha \neq 0,1\) | General family interpolating KL and reverse KL | Connects to Rényi divergences. |
Properties of \(f\)-divergences
Non-negativity
\(f(t) \geq 0 \quad \text{for all } t \geq 0, \quad \text{with equality iff } t = 1.\)Data-Processing Inequality (DPI)
Let \(P_X\) and \(Q_X\) be two distributions on \(\mathcal{X}\), and let \(P_Y\) and \(Q_Y\) be the corresponding distributions on \(\mathcal{Y}\) induced by a kernel \(P_{Y|X}\). That is, for every measurable set \(\Xi \subset \mathcal{Y}\), \(P_Y(\Xi) = \int_\mathcal{X} P_{Y|X}(x,\Xi) \, dP_X(x),\) \(Q_Y(\Xi) = \int_\mathcal{X} Q_{Y|X}(x,\Xi) \, dQ_X(x).\) Then, \(D_f(P_X \| Q_X) \;\;\geq\;\; D_f(P_Y \| Q_Y).\)Conditioning increases divergence
Let \(P_X\) be a distribution on \(\mathcal{X}\), and let \(P_Y\) and \(Q_Y\) be two distributions on \(\mathcal{Y}\) induced by kernels \(P_{Y|X}\) and \(Q_{Y|X}\), respectively.Define the conditional \(f\)-divergence as \(D_f(P_{Y|X} \| Q_{Y|X} \mid P_X) := \mathbb{E}_{P_X} \big[ D_f(P_{Y|X} \| Q_{Y|X}) \big].\) Then, the following inequality holds: \(D_f(P_{Y|X} \| Q_{Y|X} \mid P_X) \;\;\geq\;\; D_f(P_Y \| Q_Y).\)
In fact, any divergence functional that is nonnegative, satisfies DPI, and depends only on the Radon-Nikodym derivative \(\frac{dP}{dQ}\), must be an \(f-\)divergence.
Definition 3 (Rényi divergence)
Let \(\alpha\in (0,1)\cup (1, \infty)\) be a adjustable hyperparameter. The Rényi divergence of order \(\alpha\) is defined as
\[D_\alpha(P \| Q) = \frac{1}{\alpha-1}\log \int_\mathcal{X} (\frac{dP}{dQ})^\alpha dQ,\]Rényi Divergence for Common Values of α
| α | Expression | Notes / Connection |
|---|---|---|
| 0 | \(D_0(P | Q) = -\log Q(\operatorname{supp}(P))\) | Depends only on the support of \(P\); called min-divergence. |
| 0.5 | \(D_{0.5}(P | Q) = -2 \log \int \sqrt{dP \, dQ}\) | Related to Hellinger / Bhattacharyya coefficient; symmetric. |
| 1 | \(\lim_{\alpha \to 1} D_\alpha(P | Q) = D_{\mathrm{KL}}(P | Q) = \int dP \log \frac{dP}{dQ}\) | Standard KL divergence. |
| 2 | \(D_2(P | Q) = \log \int \frac{(dP)^2}{dQ}\) | Related to χ²-divergence; sometimes called collision divergence. |
| ∞ | \(D_\infty(P | Q) = \log \operatorname{ess\,sup}_x \frac{dP}{dQ}(x)\) | Maximum ratio of densities; called max-divergence. |
