Study notes--Information Measures and Divergences

Study notes for different families of information measures and divergences.

Core definitions

Definition 0 (Information Measures)

Let \(\mathcal{X}\) be a measurable space with underlying \(\sigma-\)algebra \(\mathcal{F}\), and let \(\mathcal{M}(\mathcal{X})\) be the associated family of probability measures (distributions) on \(\mathcal{X}\). An information measure is a mapping \(M: \mathcal{M}(\mathcal{X}) \times \mathcal{M}(\mathcal{X}) \to \mathbb{R}\). In addition to information measures between probability measures, we define pointwise information measure as \(M_P: \mathcal{M}(\mathcal{X})^2 \times \mathcal{X}^2 \to \mathbb{R}\).

There are no assumptions on the properties of information measures nor pointwise information measure. However, often we assume \(M(P,P)=0\) for the same two distributions.

Definition 1 (Divergences)

A divergence \(D\) is an information measure that satisfies

\(D(P\|Q)=0\) if and only if \(P=Q\) almost everywhere;
\(D(P\|Q)\geq 0\) for any \(P \ll Q\) (i.e. \(P\) is absolutely continuous with respect to \(Q\)).

Let \(P\) and \(Q\) be two probability measures on a measurable space \((\mathcal{X}, \mathcal{F})\), with \(P \ll Q\). There are some families of divergences worthing investigating.

Definition 2 (\(f\)-divergence)

Let \(f: (0,\infty)\to \mathbb{R}\) be a convex function with \(f(1)=0\). The \(f\)-divergence between \(P\) and \(Q\) is defined as

\[D_f(P \| Q) = \int_\mathcal{X} f(\frac{dP}{dQ})dQ,\]

where \(\frac{dP}{dQ}\) is the Radon-Nikodym derivative between \(P\) and \(Q\).

Some examples of \(f-\)divergences

Name	\(f(t)\)	Expression \(D_f(P \| Q)\)	Notes
Kullback–Leibler (KL)	\(f(t) = t \log t\)	\(\int \log\!\left(\tfrac{dP}{dQ}\right) \, dP\)	Most common divergence; limit of Rényi as \(\alpha \to 1\).
Reverse KL	\(f(t) = -\log t\)	\(\int -\log\!\left(\tfrac{dP}{dQ}\right) \, dQ\)	Asymmetric; used in variational inference.
Total Variation (TV)	\(f(t) = \tfrac{1}{2}\vert t-1\vert\)	\(\tfrac{1}{2} \int \vert dP - dQ\vert\)	Metric; max probability of distinguishing \(P\) vs \(Q\).
χ²-divergence	\(f(t) = (t-1)^2\)	\(\int \tfrac{(dP-dQ)^2}{dQ}\)	Related to Pearson’s χ² test statistic.
Hellinger divergence	\(f(t) = (\sqrt{t}-1)^2\)	\(2\!\left(1 - \int \sqrt{dP\, dQ}\right)\)	Square of Hellinger distance.
Jensen–Shannon (JS)	\(f(t) = t \log t - (t+1)\log\!\tfrac{t+1}{2}\)	\(\tfrac{1}{2}\mathrm{KL}(P\|M)+\tfrac{1}{2}\mathrm{KL}(Q\|M), \; M=\tfrac{1}{2}(P+Q)\)	Symmetric, bounded by \(\log 2\).
Triangular Discrimination	\(f(t) = \tfrac{(t-1)^2}{t+1}\)	\(\int \tfrac{(P-Q)^2}{P+Q}\)	Related to Bhattacharyya distance.
Squared Hellinger (Bhattacharyya)	\(f(t) = 1 - \sqrt{t}\)	\(-\log \int \sqrt{dP\, dQ}\)	Used in pattern recognition.
α-divergence (Amari)	\(f(t) = \tfrac{t^\alpha - \alpha(t-1) -1}{\alpha(\alpha-1)}, \; \alpha \neq 0,1\)	General family interpolating KL and reverse KL	Connects to Rényi divergences.

Properties of \(f\)-divergences

Non-negativity
\(f(t) \geq 0 \quad \text{for all } t \geq 0, \quad \text{with equality iff } t = 1.\)
Data-Processing Inequality (DPI)
Let \(P_X\) and \(Q_X\) be two distributions on \(\mathcal{X}\), and let \(P_Y\) and \(Q_Y\) be the corresponding distributions on \(\mathcal{Y}\) induced by a kernel \(P_{Y|X}\). That is, for every measurable set \(\Xi \subset \mathcal{Y}\), \(P_Y(\Xi) = \int_\mathcal{X} P_{Y|X}(x,\Xi) \, dP_X(x),\) \(Q_Y(\Xi) = \int_\mathcal{X} Q_{Y|X}(x,\Xi) \, dQ_X(x).\) Then, \(D_f(P_X \| Q_X) \;\;\geq\;\; D_f(P_Y \| Q_Y).\)
Conditioning increases divergence
Let \(P_X\) be a distribution on \(\mathcal{X}\), and let \(P_Y\) and \(Q_Y\) be two distributions on \(\mathcal{Y}\) induced by kernels \(P_{Y|X}\) and \(Q_{Y|X}\), respectively.
Define the conditional \(f\)-divergence as \(D_f(P_{Y|X} \| Q_{Y|X} \mid P_X) := \mathbb{E}_{P_X} \big[ D_f(P_{Y|X} \| Q_{Y|X}) \big].\) Then, the following inequality holds: \(D_f(P_{Y|X} \| Q_{Y|X} \mid P_X) \;\;\geq\;\; D_f(P_Y \| Q_Y).\)

In fact, any divergence functional that is nonnegative, satisfies DPI, and depends only on the Radon-Nikodym derivative \(\frac{dP}{dQ}\), must be an \(f-\)divergence.

Definition 3 (Rényi divergence)

Let \(\alpha\in (0,1)\cup (1, \infty)\) be a adjustable hyperparameter. The Rényi divergence of order \(\alpha\) is defined as

\[D_\alpha(P \| Q) = \frac{1}{\alpha-1}\log \int_\mathcal{X} (\frac{dP}{dQ})^\alpha dQ,\]

Rényi Divergence for Common Values of α

α	Expression	Notes / Connection
0	\(D_0(P \| Q) = -\log Q(\operatorname{supp}(P))\)	Depends only on the support of \(P\); called min-divergence.
0.5	\(D_{0.5}(P \| Q) = -2 \log \int \sqrt{dP \, dQ}\)	Related to Hellinger / Bhattacharyya coefficient; symmetric.
1	\(\lim_{\alpha \to 1} D_\alpha(P \| Q) = D_{\mathrm{KL}}(P \| Q) = \int dP \log \frac{dP}{dQ}\)	Standard KL divergence.
2	\(D_2(P \| Q) = \log \int \frac{(dP)^2}{dQ}\)	Related to χ²-divergence; sometimes called collision divergence.
∞	\(D_\infty(P \| Q) = \log \operatorname{ess\,sup}_x \frac{dP}{dQ}(x)\)	Maximum ratio of densities; called max-divergence.

Yue Zhang