Chernoff and Cramer

When working with statistical data, it is often desirable to be able to quantify the probability of certain undesirable events taking place. In this post we discuss an interesting connection between convex optimization and extreme event analysis. We start with the classical Chernoff bound for the empirical average.

Chernoff’s Bound

Let Y_i be data samples in \Re^d drawn independently and identically from the distribution \mb P_0 with mean \mu_0. An important class of an undesirable events can be expressed as the empirical average of data Y_i realizing in a convex set C. When for instance Y_i have an interpretation as losses, then knowing the probability of the average loss exceeding a critical value t is paramount. In that case, knowing the probability that the empirical average realizes in the half space \set{y}{\sum_{i=1}^n y_i \geq t} would be of great interest. Chernoff’s classical inequality quantifies the probability of such events quite nicely.

Chernoff’s Theorem : The probability of the empirical average \hat \mu of Y_i realizes in a closed set C is

(1)   \begin{equation*}   \log \mb P_0^n \left(\hat \mu \in C \right) \leq -n \cdot \textstyle\inf_{\mu \in C}\Lambda^\star(\mu-\mu_0) \end{equation*}

with \Lambda^\star(\mu) \defn \sup_{\lambda\in\Re^d} \, \lambda\tpose \mu - \Lambda(\lambda) the convex dual of the log moment generating function \Lambda(\lambda)\defn \log \mb E_{\mb P}[e^{\lambda\tpose Y}] and \mb P the distribution of Y_i-\mu_0.

Proof: Let \lambda\in\Re^d and t\in\Re, and consider the positive function f(\mu)\defn e^{n\lambda\tpose \mu + t}. If the function f satisfies f(\mu)\geq 1 for all \mu in C, then we may conclude that

    \[   \mb P_0^n(\hat \mu \in C) \leq  \E{\mb P_0^n}{f(\textstyle\frac1n \sum_{i=1}^n Y_i)}. \]

Using the independence of distinct samples Y_i and taking the logarithm on both sides of the previous inequality establishes

    \begin{align*}    \log\mb P_0^n(\hat \mu \in C) \leq & \, t + n \cdot \E{\mb P_0}{e^{\lambda\tpose Y}},\\    \leq &  \, t + n \cdot \lambda\tpose \mu_0 + n \cdot \Lambda(\lambda). \end{align*}

It is clear that f(\mu)\geq 1 for all \mu in C if and only if -n \lambda\tpose \mu\leq t for all \mu in C. From this we obtain the general form of Chernoff’s bound

    \begin{align*}   \log\mb P_0^n(\hat \mu \in C) \leq & - n \cdot {\sup_{\lambda \in \Re^d} }\inf_{\mu \in C} \, \lambda\tpose (\mu-\mu_0) - \Lambda(\lambda), \\   \leq & -n \cdot \textstyle \inf_{\mu \in C} \Lambda^\star(\mu-\mu_0). \end{align*}

The last inequality follows from the minimax theorem for convex optimization.

Geometric Interpretation

The Chernoff bound (1) expresses the probability of the extreme event \sum_{i=1}^n Y_i \in C in terms of the convex conjugate of the log moment generating function. This makes that computing Chernoff’s bound can be done by solving a convex optimization problem.

The function \Lambda(\lambda) = \log \mb E_{\mb P}[e^{\lambda\tpose Y}] is the log moment generating function of the recentered distribution \mb P generating the data and is always convex. We give the cumulant generating function for some common standardized (zero mean, unit variance) distributions in the table below.

\mb P \Lambda(\lambda) dom \Lambda
Normal \frac{1}{2}\lambda\tpose \lambda \Re^d
Laplace \log \left(\frac{2}{2-\lambda\tpose\lambda}\right) \lambda\tpose\lambda <2

Furthermore, it comes with a nice geometric interpretation as well. The function \Lambda^\star(\mu - \mu_0) is positive and convex and thus defines a distance between the mean of the distribution generating the data and its empirical mean.

Rendered by

The minimum distance r as measured by the convex dual of the log moment generating function \Lambda^\star between the mean \mu_0 of the distribution generating the data and the set C bounds the probability of the event \hat \mu \in C taking place.

Cramer’s Theorem

Chernoff’s bound is furthermore exponentially tight as it correctly identifies the exact exponential rate with which the probability of the event \hat{\mu}\in C diminishes to zero. The last surprising fact is codified in Cramer’s theorem.

Cramer’s Theorem : Assume that the distribution \mb P_0 satisfies 0 \in \dom \Lambda. Then the probability that the empirical average of Y_i realizes in a open set C diminishes with exponential rate

(2)   \begin{equation*}   \liminf_{n\to\infty}\frac{1}{n} \log \mb P_0^n \left( \hat{\mu} \in C \right) \geq - \textstyle\inf_{\mu \in C} \, \Lambda^\star(\mu- \mu_0). \end{equation*}

The lower bound in Cramer’s thorem makes that the Chernoff inequality accurately quantifies the probability of extreme events taking place as the number of samples n tends toward infinity. Notice that for the Cramer’s theorem convexity of the set is not a requirement. Note that 0 \in \dom \Lambda as by definition of \mb P_0 being a probability distribution it follows that \Lambda(0) = 0. The condition 0 \in \dom \Lambda requires the distribution \mb P to be light-tailed. The tails of the distribution \mb P must diminish at an exponential rate.


  1. A. Dembo, and O. Zeitouni. “Large Deviations Techniques and Applications”, Springer (2010).
  2. S. Boyd, and L. Vandenberghe. “Convex Optimization”, Cambridge University Press (2004).

Leave a Reply

Your email address will not be published. Required fields are marked *