Category Archives: Supervised Learning

Supervised learning is the machine learning task of inferring a function from a limited amount of labeled training data. Learning algorithm need to generalize from the training data to unseen data as good as possible.

Nadaraya-Watson Kernel Regression

Given two real random variables X and Y with joint distribution \mb Q, we denote the regression function by

(1)   \begin{equation*}    m(x) \defn \E{\mb Q}{Y | X=x}. \end{equation*}

The aim is to estimate the regression function m from n data samples (X_i, Y_i) drawn independently and identically from the data distribution \mb Q.

Among the most popular nonparametric regression approaches are the Nadaraya-Watson kernel regressors. This family of regressors was introduced by Nadaraya and Watson and traces back to the regressogram from Tukey. The Nadaraya-Watson kernel regressor is defined by

(2)   \begin{equation*}   \hat m_{\mathrm{nw}} (x) \defn \frac{\sum_{i=1}^n Y_i \cdot K((x-x_i)/h_n)}{\sum_{i=1}^n K((x-x_i)/h_n)}. \end{equation*}

The kernel function is a bounded and integrable real-valued function such that

    \[   \lim_{\norm{x}\to\infty} \norm{x} K(x) = 0. \]

The previous condition ensures that the kernel estimate (2) is local. Indeed, the weight K((x-x_i)/h_n) assigned to observation i decreases at least inverse proportional to the distance with the point x of interest. The constant h_n is commonly referred to as the bandwidth. An appropriate choice of the bandwidth parameter is crucial in practice.

Popular Kernels

To make the Nadaraya-Watson estimator (2) more concrete, we list here four kernels which are popular in practice.

Name Kernel Function K
Naive K(x) \defn \frac12 \one_{-1\leq x\leq 1}
Tri-Cubic K(x) \defn \frac{70}{81} \max((1-\norm{x}^3)^3, 0)
Gaussian K(x) \defn \frac{1}{\sqrt{2 \pi}} e^{-\norm{x}^2/2}
Epanechnikov K(x) \defn \frac{3}{4} \max(1-x^2, 0)

It should be remarked that all four stated kernels are positive. Although non negative kernels exists, they are not as easily interpretable. For a better understanding of the different properties among all four defined kernels, we have visualized them in the figure below.

Rendered by

Note that the naive kernel is discontinuous which implies that the regression function \hat m_{\mr nw} is discontinuous as well. Discontinuous regression functions may not be desirable in practice. The Gaussian kernel and all its derivatives are continuous functions which implies a smooth regression function \hat m_{\mr nw}. As the Gaussian kernel is unbounded however, the function regression function \hat m_{\mr nw}(x) depends at every point x on all observed samples (x_i, y_i).

Pointwise Bias Properties

It is assumed that the data distribution is well behaved. That is, it is continuous with respect to the Lebesgue measure on \Re^2 with continuous density function f_{X, Y}. The marginal density of the distribution of X is denoted with the function f_X. It should be clear that the regression function m is only defined properly on the support set of f_X, i.e. \dom m = \set{x}{f(x)>0}. The marginal density f_Y of the distribution of Y is assumed to have bounded support.

  • A.1 There exists a constant M such that \norm{Y} \leq M <\infty.

Due to the local nature of the kernel estimate (2), the continuity of function m to be estimated plays a prominent role in any statement of asymptotic properties. More precisely, we need the assumption that

  • A.2 the functions m and f are continuous at x and f(x) \neq 0.

A condition on the bandwidth parameter must be introduced as well. The bandwidth parameter h_n should be taken smaller as more data is available, but not faster than \frac{1}{n}. To that end we introduce the following condition

  • A.3 \lim_{n\to\infty} h_n = 0 and \lim_{n\to\infty} n h_n = \infty.

Asymptotic bias: Suppose that assumptions A.1–A.3 hold. Then the kernel estimator \hat m_{\mathrm{nw}} is asymptotically unbiased at point x:

    \[   \lim_{n\to\infty} \E{}{\hat m_{\mathrm{nw}}(x)} = m(x). \]


  1. P. Sarda, and P. Vieu, “Kernel Regression, in Smoothing and Regression: Approaches, Computation, and Application”, John Wiley & Sons (2000)