Nadaraya-Watson Kernel Regression

Given two real random variables X and Y with joint distribution \mb Q, we denote the regression function by

(1)   \begin{equation*}    m(x) \defn \E{\mb Q}{Y | X=x}. \end{equation*}

The aim is to estimate the regression function m from n data samples (X_i, Y_i) drawn independently and identically from the data distribution \mb Q.

Among the most popular nonparametric regression approaches are the Nadaraya-Watson kernel regressors. This family of regressors was introduced by Nadaraya and Watson and traces back to the regressogram from Tukey. The Nadaraya-Watson kernel regressor is defined by

(2)   \begin{equation*}   \hat m_{\mathrm{nw}} (x) \defn \frac{\sum_{i=1}^n Y_i \cdot K((x-x_i)/h_n)}{\sum_{i=1}^n K((x-x_i)/h_n)}. \end{equation*}

The kernel function is a bounded and integrable real-valued function such that

    \[   \lim_{\norm{x}\to\infty} \norm{x} K(x) = 0. \]

The previous condition ensures that the kernel estimate (2) is local. Indeed, the weight K((x-x_i)/h_n) assigned to observation i decreases at least inverse proportional to the distance with the point x of interest. The constant h_n is commonly referred to as the bandwidth. An appropriate choice of the bandwidth parameter is crucial in practice.

Popular Kernels

To make the Nadaraya-Watson estimator (2) more concrete, we list here four kernels which are popular in practice.

Name Kernel Function K
Naive K(x) \defn \frac12 \one_{-1\leq x\leq 1}
Tri-Cubic K(x) \defn \frac{70}{81} \max((1-\norm{x}^3)^3, 0)
Gaussian K(x) \defn \frac{1}{\sqrt{2 \pi}} e^{-\norm{x}^2/2}
Epanechnikov K(x) \defn \frac{3}{4} \max(1-x^2, 0)

It should be remarked that all four stated kernels are positive. Although non negative kernels exists, they are not as easily interpretable. For a better understanding of the different properties among all four defined kernels, we have visualized them in the figure below.

Rendered by QuickLaTeX.com

Note that the naive kernel is discontinuous which implies that the regression function \hat m_{\mr nw} is discontinuous as well. Discontinuous regression functions may not be desirable in practice. The Gaussian kernel and all its derivatives are continuous functions which implies a smooth regression function \hat m_{\mr nw}. As the Gaussian kernel is unbounded however, the function regression function \hat m_{\mr nw}(x) depends at every point x on all observed samples (x_i, y_i).

Pointwise Bias Properties

It is assumed that the data distribution is well behaved. That is, it is continuous with respect to the Lebesgue measure on \Re^2 with continuous density function f_{X, Y}. The marginal density of the distribution of X is denoted with the function f_X. It should be clear that the regression function m is only defined properly on the support set of f_X, i.e. \dom m = \set{x}{f(x)>0}. The marginal density f_Y of the distribution of Y is assumed to have bounded support.

  • A.1 There exists a constant M such that \norm{Y} \leq M <\infty.

Due to the local nature of the kernel estimate (2), the continuity of function m to be estimated plays a prominent role in any statement of asymptotic properties. More precisely, we need the assumption that

  • A.2 the functions m and f are continuous at x and f(x) \neq 0.

A condition on the bandwidth parameter must be introduced as well. The bandwidth parameter h_n should be taken smaller as more data is available, but not faster than \frac{1}{n}. To that end we introduce the following condition

  • A.3 \lim_{n\to\infty} h_n = 0 and \lim_{n\to\infty} n h_n = \infty.

Asymptotic bias: Suppose that assumptions A.1–A.3 hold. Then the kernel estimator \hat m_{\mathrm{nw}} is asymptotically unbiased at point x:

    \[   \lim_{n\to\infty} \E{}{\hat m_{\mathrm{nw}}(x)} = m(x). \]

References

  1. P. Sarda, and P. Vieu, “Kernel Regression, in Smoothing and Regression: Approaches, Computation, and Application”, John Wiley & Sons (2000)

Csiszar and Sanov

It turns out that the Chernoff bound previously discussed has a neat generalization in large deviation theory. In this setting the object under study is the entire empirical distribution \hat{\mb P}\defn \frac{1}{n} \sum_{i=1}^n \delta_{Y_i} instead of merely the empirical average of the data. Now we can study extreme events in which the empirical distribution realizes in a convex set \mc C.

Csizar’s Theorem

Let Y_i be data samples in \Re^d drawn independently and identically from the distribution \mb P_0 with mean \mu_0. An important class of an undesirable events can be expressed as the empirical average of data Y_i realizing in a closed convex set \mc C. The closed convex set \mc C is a subset of the probability simplex \mc P_d of distributions on \Re^d. For the topology employed here and further technical conditions on the set \mc C, the reader is referred to the seminal paper of Csiszar. For the sake of readability, these technical points are supressed here.

Csizar’s Theorem : The probability that the empirical distribution of Y_i realizes in a closed convex set \mc C is bounded by

(1)   \begin{equation*}   \log \mb P_0^n \left( \hat{\mb P} \in \mc C \right) \leq -n \cdot \textstyle\inf_{\mb P\in \mc C} \, D(\mb P, \mb P_0) \end{equation*}

where D(\mb P, \mb P_0) is the Kullback-Leibler divergence between \mb P and \mb P_0.

Geometric Interpretation

Csizar’s theorem can be seen to subsume Chernoff’s result by recognizing that the empirical average \frac1n \sum_{i=1}^n Y_i realizing in closed convex set C \subset \Re^d is stated equivalently as the empirical distribution \hat{\mb P} realizing in the closed convex set \set{\mb P}{\int y \, \mb P(\d y) \in C}. In fact both bounds are exactly the same as it can be shown that

    \[   \inf_{\mu\in C} \Lambda^\star(\mu-\mu_0) = \inf_{\mb P} \{D(\mb P, \mb P_0) : \textstyle\int y \, \mb P(\d y) \in C\}, \]

where \Lambda^\star is the convex dual of the log moment generating function \Lambda(\lambda) \defn \log \mb E_{\mb P}[e^{\lambda\tpose Y}] of the distribution \mb P of Y_i - \mu_0. Observe that Csizar’s inequality (1) admits a nice geometrical interpretation as done in the figure below. The probability of the empirical distribution realizing in a set \mc C is bounded above by its distance as measured by the Kullback-Leibler divergence from the distribution generating the data \mb P and the set of extreme events \mc C.

As the Kullback-Leibler divergence is jointly convex in both its arguments, Csizar’s bound (1) is stated in terms of a convex optimization problem. That is, a convex optimization problem over distributions on \Re^n, rather than vectors in \Re^n.

Rendered by QuickLaTeX.com

Sanov’s Theorem

Sanov’s bound is furthermore exponentially tight as it correctly identifies the exponential rate \inf_{\mb P} \set{D(\mb P, \mb P_0)}{\mb P \in \mc C} with which the probability of the event \hat{\mb P}\in \mc C diminishes to zero.

Sanov’s Theorem : The probability that the empirical distribution of Y_i realizes in a open convex set \mc C diminishes with exponential rate

(2)   \begin{equation*}   \frac{1}{n} \log \mb P_0^n \left( \hat{\mb P} \in \mc C \right) \to - \textstyle\inf_{\mb P \in \mc C} \, D(\mb P, \mb P_0). \end{equation*}

The previous makes that the Chernoff and Csizar inequality accurately quantify the probability of extreme events taking place. Furthermore, due to the convexity of the objects involved, computing these bounds amounts to solving a convex optimization problem.

References

  1. Csiszár, I. “Sanov property, generalized I-projection and a conditional limit theorem.” The Annals of Probability (1984): 768-793.

Chernoff and Cramer

When working with statistical data, it is often desirable to be able to quantify the probability of certain undesirable events taking place. In this post we discuss an interesting connection between convex optimization and extreme event analysis. We start with the classical Chernoff bound for the empirical average.

Chernoff’s Bound

Let Y_i be data samples in \Re^d drawn independently and identically from the distribution \mb P_0 with mean \mu_0. An important class of an undesirable events can be expressed as the empirical average of data Y_i realizing in a convex set C. When for instance Y_i have an interpretation as losses, then knowing the probability of the average loss exceeding a critical value t is paramount. In that case, knowing the probability that the empirical average realizes in the half space \set{y}{\sum_{i=1}^n y_i \geq t} would be of great interest. Chernoff’s classical inequality quantifies the probability of such events quite nicely.

Chernoff’s Theorem : The probability of the empirical average \hat \mu of Y_i realizes in a closed set C is

(1)   \begin{equation*}   \log \mb P_0^n \left(\hat \mu \in C \right) \leq -n \cdot \textstyle\inf_{\mu \in C}\Lambda^\star(\mu-\mu_0) \end{equation*}

with \Lambda^\star(\mu) \defn \sup_{\lambda\in\Re^d} \, \lambda\tpose \mu - \Lambda(\lambda) the convex dual of the log moment generating function \Lambda(\lambda)\defn \log \mb E_{\mb P}[e^{\lambda\tpose Y}] and \mb P the distribution of Y_i-\mu_0.

Proof: Let \lambda\in\Re^d and t\in\Re, and consider the positive function f(\mu)\defn e^{n\lambda\tpose \mu + t}. If the function f satisfies f(\mu)\geq 1 for all \mu in C, then we may conclude that

    \[   \mb P_0^n(\hat \mu \in C) \leq  \E{\mb P_0^n}{f(\textstyle\frac1n \sum_{i=1}^n Y_i)}. \]

Using the independence of distinct samples Y_i and taking the logarithm on both sides of the previous inequality establishes

    \begin{align*}    \log\mb P_0^n(\hat \mu \in C) \leq & \, t + n \cdot \E{\mb P_0}{e^{\lambda\tpose Y}},\\    \leq &  \, t + n \cdot \lambda\tpose \mu_0 + n \cdot \Lambda(\lambda). \end{align*}

It is clear that f(\mu)\geq 1 for all \mu in C if and only if -n \lambda\tpose \mu\leq t for all \mu in C. From this we obtain the general form of Chernoff’s bound

    \begin{align*}   \log\mb P_0^n(\hat \mu \in C) \leq & - n \cdot {\sup_{\lambda \in \Re^d} }\inf_{\mu \in C} \, \lambda\tpose (\mu-\mu_0) - \Lambda(\lambda), \\   \leq & -n \cdot \textstyle \inf_{\mu \in C} \Lambda^\star(\mu-\mu_0). \end{align*}

The last inequality follows from the minimax theorem for convex optimization.

Geometric Interpretation

The Chernoff bound (1) expresses the probability of the extreme event \sum_{i=1}^n Y_i \in C in terms of the convex conjugate of the log moment generating function. This makes that computing Chernoff’s bound can be done by solving a convex optimization problem.

The function \Lambda(\lambda) = \log \mb E_{\mb P}[e^{\lambda\tpose Y}] is the log moment generating function of the recentered distribution \mb P generating the data and is always convex. We give the cumulant generating function for some common standardized (zero mean, unit variance) distributions in the table below.

\mb P \Lambda(\lambda) dom \Lambda
Normal \frac{1}{2}\lambda\tpose \lambda \Re^d
Laplace \log \left(\frac{2}{2-\lambda\tpose\lambda}\right) \lambda\tpose\lambda <2

Furthermore, it comes with a nice geometric interpretation as well. The function \Lambda^\star(\mu - \mu_0) is positive and convex and thus defines a distance between the mean of the distribution generating the data and its empirical mean.

Rendered by QuickLaTeX.com

The minimum distance r as measured by the convex dual of the log moment generating function \Lambda^\star between the mean \mu_0 of the distribution generating the data and the set C bounds the probability of the event \hat \mu \in C taking place.

Cramer’s Theorem

Chernoff’s bound is furthermore exponentially tight as it correctly identifies the exact exponential rate with which the probability of the event \hat{\mu}\in C diminishes to zero. The last surprising fact is codified in Cramer’s theorem.

Cramer’s Theorem : Assume that the distribution \mb P_0 satisfies 0 \in \dom \Lambda. Then the probability that the empirical average of Y_i realizes in a open set C diminishes with exponential rate

(2)   \begin{equation*}   \liminf_{n\to\infty}\frac{1}{n} \log \mb P_0^n \left( \hat{\mu} \in C \right) \geq - \textstyle\inf_{\mu \in C} \, \Lambda^\star(\mu- \mu_0). \end{equation*}

The lower bound in Cramer’s thorem makes that the Chernoff inequality accurately quantifies the probability of extreme events taking place as the number of samples n tends toward infinity. Notice that for the Cramer’s theorem convexity of the set is not a requirement. Note that 0 \in \dom \Lambda as by definition of \mb P_0 being a probability distribution it follows that \Lambda(0) = 0. The condition 0 \in \dom \Lambda requires the distribution \mb P to be light-tailed. The tails of the distribution \mb P must diminish at an exponential rate.

References

  1. A. Dembo, and O. Zeitouni. “Large Deviations Techniques and Applications”, Springer (2010).
  2. S. Boyd, and L. Vandenberghe. “Convex Optimization”, Cambridge University Press (2004).