How Dommel and Pichler Finally Cracked the Kernel Approximation Problem That Was Holding Machine Learning Back | AI Trend Blend

Statistical Learning · Journal of Machine Learning Research 26 (2025) 1–30 · 18 min read

How Two Researchers from Chemnitz Quietly Fixed One of the Oldest Problems in Kernel Machine Learning

Paul Dommel and Alois Pichler from TU Chemnitz developed a Taylor series based approach to approximate kernel functions in reproducing kernel Hilbert spaces. Their work produces the first non-exponential eigenfunction bounds ever recorded and opens the door to regularization parameters far smaller than anything the existing literature has considered safe.

Kernel Approximation RKHS Gaussian Kernel Nystrom Method Eigenfunction Bounds Taylor Series Kernel Ridge Regression Statistical Learning Hilbert Matrix JMLR 2025

There is a quiet assumption sitting beneath almost every kernel method in machine learning. It says that the regularization parameter needs to decrease no faster than the inverse of the sample size. Practitioners have accepted this for years because no one had the theoretical tools to challenge it properly. Paul Dommel and Alois Pichler just published a paper in JMLR that dismantles this assumption entirely and replaces it with something far more flexible and far more powerful.

What Kernel Methods Are Actually Doing and Why Approximation Matters

A kernel method starts with a choice. You pick a function that measures similarity between two data points and that function implicitly defines a space of possible predictors called a reproducing kernel Hilbert space. The brilliant insight behind these methods is that you never have to work directly in that space. You only ever need to evaluate the kernel function between observed data points and the points you want to predict. This is the kernel trick and it made support vector machines and Gaussian processes practical.

The trouble is that working with the full kernel gram matrix becomes computationally expensive as your dataset grows. A dataset with ten thousand points requires storing and inverting a ten thousand by ten thousand matrix. Low rank approximation methods like the Nystrom approach sidestep this by replacing the full kernel matrix with a much smaller approximation built from a carefully chosen subset of supporting points. But how many supporting points do you need and how do you know the approximation is still accurate? The answer to that question depends entirely on how well you can approximate the kernel function itself in the range of the associated integral operator. That is precisely the problem this paper solves.

The key object is what the authors call the point evaluation function. For any fixed data point x, the function kx assigns to every other point y the value k(x, y). Understanding how well you can approximate kx inside the image of the Hilbert Schmidt integral operator is the same as understanding how many supporting points you need for a reliable low rank approximation. Before this paper, the best available bounds were either too loose to be useful in practice or required assumptions that no one had verified.

The Core Insight

The paper constructs an explicit weight function by matching the first m Taylor coefficients of the kernel function. This minimal moment function has a squared norm bounded by m squared regardless of where in the input domain you evaluate it. That single fact is the seed from which every other result in the paper grows including the polynomial eigenfunction bounds and the liberalized regularization theory.

The Minimal Moment Function and Why It Changes Everything

The central construction in the paper is what Dommel and Pichler call the minimal moment function. For a fixed point x in the unit interval they ask for the function of smallest L2 norm that satisfies a specific set of moment constraints. Those constraints say that the integral of z to the power l times the function w must equal x to the power l for each l from zero to m minus one.

The explicit solution turns out to be a polynomial of degree m minus one whose coefficients are determined by the Hilbert matrix. The Hilbert matrix is that notorious object whose entry in row i and column j equals one divided by i plus j minus one. It is famously ill conditioned for large m but that does not actually cause problems here because the paper only needs the existence and norm of the solution rather than numerical computation of the coefficients.

The formal result establishes that the squared L2 norm of this minimal moment function is bounded above by m squared for every point x in the unit interval. This is Theorem 2 in the paper and the proof is elegant. The authors construct an auxiliary function by stitching together rescaled copies of the minimal moment function at the boundary point x equals one and then show that this auxiliary function always has the same or larger norm. The norm at the boundary is computed exactly and equals m squared. Therefore the norm at every interior point is at most m squared.

Moment Constraint (Eq. 1.1) $$\int_0^1 z^\ell \, w_m^x(z) \, dz = x^\ell \quad \ell = 0, \ldots, m-1$$

The multivariate extension works by taking products. For a point x in the d dimensional unit cube you define the weight function as the product of the one dimensional minimal moment functions in each coordinate divided by the density of the underlying probability measure. This product function satisfies a multivariate generalization of the moment constraints and has a squared norm bounded by a constant times m to the power 2d. The extra factor of the inverse density accounts for the fact that the underlying measure may not be uniform.

The moment matching property is what makes the weight function useful for approximating kernels. A radial kernel like the Gaussian depends on the squared distance between two points. The Taylor series of that kernel in the squared distance converges everywhere. By choosing the weight function to match the first m Taylor coefficients you guarantee that the first m terms in the Taylor expansion of the approximation error vanish exactly. Only the tail of the Taylor series contributes to the error.

Approximating the Gaussian Kernel and Getting Explicit Bounds

The Gaussian kernel is the workhorse of practical kernel methods. It takes the form of the exponential of negative sigma times the squared distance between two points where sigma is a positive width parameter. The paper focuses most of its attention on this kernel because its Taylor coefficients are explicit and their decay is well understood.

Theorem 4 in the paper provides a uniform bound on the approximation error in the infinity norm. The bound shows that by choosing m proportional to 3 times cσ times s plus 2 where cσ is the maximum of 1 and 2e times sigma times d, the approximation error at the finest level scales like 3 times the product td raised to the power negative 3td. Here t is a parameter that exceeds a certain threshold depending on the dimension and the density bounds. The approximation error shrinks superexponentially as s grows.

Uniform Approximation Bound (Theorem 4) $$\sup_{x \in [0,1]^d} \left\| L_k W_m^x – k_x \right\|_\infty \leq 3(td)^{-3td}$$

The more important bound for applications is in the norm of the reproducing kernel Hilbert space itself. This is Proposition 2 and the result is that the squared RKHS norm of the approximation error is bounded by 9 times td raised to the power negative 2td. The proof works by combining the infinity norm bound with a technical inner product estimate and is one of the more involved arguments in the paper.

What makes both bounds remarkable is how rapidly they decay. To get the approximation error below any fixed epsilon you need m to grow only logarithmically in 1 over epsilon. The number of supporting points required by the Nystrom method scales accordingly which is a dramatic improvement over what existing theory predicted.

Why This Bound Matters for Low Rank Methods

Every low rank kernel method including the Nystrom algorithm kernel PCA and divide and conquer kernel ridge regression depends on how accurately you can approximate kernel functions in the image of the integral operator. Dommel and Pichler give the first explicit tight bound on this approximation quality for the Gaussian kernel. The consequence is a concrete and verifiable criterion for how many supporting points you actually need rather than an order of magnitude estimate that leaves you guessing.

The First Polynomial Bound on Gaussian Kernel Eigenfunctions

Here is the result that the authors themselves describe as the most novel contribution of the paper. Every kernel function has an associated Mercer decomposition which expresses the kernel as an infinite sum of products of eigenfunctions weighted by eigenvalues. These eigenfunctions form an orthonormal basis for the reproducing kernel Hilbert space. How large can the lth eigenfunction get in the uniform norm?

Before this paper the standard answer came from the Cauchy Schwarz inequality applied to the reproducing property. That bound says the maximum of the absolute value of the lth eigenfunction is at most the square root of k(x,x) divided by the square root of the lth eigenvalue. For the Gaussian kernel where k(x,x) equals 1 and the eigenvalues decay exponentially in l, this bound grows exponentially in l. An exponentially growing bound is essentially useless for any application that needs to sum over many eigenfunctions.

Theorem 7 in the paper shows that the eigenfunctions of the d dimensional Gaussian kernel satisfy a polynomial bound. Specifically the maximum absolute value of the lth eigenfunction is at most a constant times l to the power 2. This is the bound stated in equation (3.14) of the paper and the authors note that to the best of their knowledge it is the first non-exponential bound of this kind ever established for the Gaussian kernel on a bounded domain.

Polynomial Eigenfunction Bound (Theorem 7 / Eq. 3.14) $$\max_{x \in [0,1]^d} |\varphi_\ell(x)| \leq b \, \ell^2, \quad \ell = 1, 2, \ldots$$

The proof strategy is indirect and genuinely clever. The key observation is that for the regularization parameter lambda equal to the lth eigenvalue, the squared L2 norm of the optimal weight function wλ includes the term one quarter times the squared value of the lth eigenfunction at x. The bound on the norm of wλ from Theorem 5 therefore gives a bound on the pointwise value of the lth eigenfunction. Combining with the eigenvalue decay bound from Lemma 1 which says eigenvalues decrease no faster than a stretched exponential in l to the power 2 over d, the authors trace through the algebra and arrive at the quadratic polynomial bound.

The eigenvalue decay bound itself requires a careful argument. In one dimension it uses a connection between the minimum gap between random uniform points and the smallest eigenvalue of the random gram matrix via results from Diederichs and Iske on condition numbers of radial basis function interpolation matrices. The multivariate case exploits the product structure of the Gaussian kernel to reduce to the one dimensional case.

Concentration Bounds That Allow Near Machine Precision Regularization

The most immediately practical consequence of the eigenfunction bounds is what they do to the theory of kernel ridge regression. The standard story goes like this. You have n data points. You form the kernel gram matrix and solve a regularized least squares problem. To analyze how well the solution generalizes to new data you need to show that the empirical operator built from your n samples is close to the population operator. This requires a concentration inequality.

All existing concentration inequalities for kernel operators had a constraint. They required the regularization parameter lambda to decrease no faster than 1 over n. This constraint was not arbitrary. It came from the best available bound on a quantity called N infinity of lambda which measures how hard it is to approximate kernel functions in the operator image. If N infinity of lambda grows too fast as lambda shrinks then the concentration inequality breaks down.

Dommel and Pichler show that for the Gaussian kernel N infinity of lambda grows only as the logarithm of 1 over lambda raised to the power 2d. This is dramatically slower than what previous bounds suggested. As a result the concentration inequality in Proposition 3 remains valid for regularization parameters that decay exponentially fast in n to the power 1 over 2d plus 1. In practice this means you can set lambda to something approaching machine precision and the theory still holds. That was simply not known before.

N Infinity Growth Rate (Section 4) $$N_\infty(\lambda) = O\!\left(\ln(\lambda^{-1})^{2d}\right) \quad \text{as } \lambda \to 0$$

The practical upshot is that low rank kernel methods no longer have to worry that their regularization parameter is too small. Existing theory forced practitioners to use a larger lambda than they might otherwise want because the concentration bounds required it. This paper removes that constraint and justifies the common practice of tuning lambda via cross validation down to very small values.

An Interpolation Inequality That Connects L2 Convergence to Uniform Convergence

One of the most elegant results in the paper is an interpolation inequality that lets you infer uniform convergence from L2 convergence for smooth functions. The two norms are not generally comparable. A sequence of functions can converge in L2 while having arbitrarily large pointwise values. But if the functions are smooth in the sense of having bounded norm in an appropriate Sobolev type space indexed by the eigenfunctions of the kernel then the L2 norm controls the infinity norm.

Theorem 8 makes this precise. For a function f expanded in the eigenfunction basis with coefficients c sub l, define the s norm as the square root of the sum of c sub l squared times l to the power negative 2s. If this norm is finite and s is greater than 3 then the infinity norm of f is bounded by a constant times the s norm raised to the power 3 over s times the L2 norm raised to the power 1 minus 3 over s. The constant involves the eigenfunction bound b from equation (3.14) and Euler’s formula for the value of the Riemann zeta function at 2 which equals pi squared over 6.

Interpolation Inequality (Theorem 8) $$\|f\|_\infty \leq \frac{\pi}{\sqrt{6}} \, b \, \|f\|_s^{3/s} \cdot \|f\|_2^{1-3/s}, \quad s > 3$$

This result has immediate consequences for regression. If you have a sequence of estimators converging in L2 at a known rate and if those estimators have uniformly bounded s norm then the interpolation inequality gives you a free upgrade to uniform convergence. The rate in L2 transfers to a rate in the infinity norm up to a logarithmic penalty. Results of this type have appeared in the literature before but always required stronger smoothness conditions than the ones used here. The key enabling factor is the polynomial eigenfunction bound which makes the Riemann zeta series converge.

Sharpening the Nystrom Method with Explicit Supporting Point Counts

The Nystrom method builds a low rank approximation to the kernel gram matrix by selecting a small number of supporting points and computing the approximation from the kernel evaluations at those points. The fundamental question is how many supporting points m you need to get a good approximation. Rudi Camoriano and Rosasco established in their 2015 NIPS paper that the Nystrom method needs at most order N infinity of lambda times log lambda inverse supporting points. That paper left N infinity of lambda as an abstract function.

Theorem 9 in the paper of Dommel and Pichler plugs in the explicit bound from Theorem 5 to turn that abstract statement into a concrete number. The result says that the Nystrom approximation achieves near optimal regression error whenever the number of supporting points m exceeds a threshold involving 9 times cp times the quantity 3 cσ s of lambda plus 2 all raised to the power 2d times the log of 1 over lambda.

Nystrom Supporting Point Threshold (Theorem 9) $$m \geq \Bigl(67 \vee 5\sqrt{9c_p(3c_\sigma s(\lambda)+2)^{2d}+1}\Bigr) \ln \lambda^{-1}$$

Since s of lambda grows only logarithmically in 1 over lambda the entire threshold grows only as log lambda inverse raised to the power 2d plus 1 times log log lambda inverse. For any fixed dimension d this is a slowly growing function. It tells you that you need far fewer supporting points than you might have guessed from cruder bounds and it gives you a formula you can actually evaluate given your dataset size and desired accuracy.

Complete Proposed Model Code in Python

The following is a complete Python implementation of the core constructions in Dommel and Pichler (2025). It covers the minimal moment function via Hilbert matrix inversion, the multivariate weight function, kernel approximation via the integral operator image, the Nystrom low rank approximation based on the derived supporting point count, and a numerical demonstration on synthetic data confirming the polynomial eigenfunction bound.

# =============================================================================
# Kernel Function Approximation via Minimal Moment Functions
# Paper: "On the Approximation of Kernel Functions"
# Authors: Paul Dommel and Alois Pichler, TU Chemnitz
# Journal: Journal of Machine Learning Research 26 (2025) 1-30
# =============================================================================

import numpy as np
from scipy.linalg import hilbert, solve
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings('ignore')


# ─── SECTION 1: Hilbert Matrix and Minimal Moment Function ───────────────────

def build_hilbert_matrix(m: int) -> np.ndarray:
    """Construct the m x m Hilbert matrix H_m with H[i,j] = 1/(i+j+1).
    
    Theorem 1 in the paper establishes that the minimal moment function
    coefficients satisfy H_m * alpha = x_bar where x_bar = (1, x, ..., x^{m-1}).
    """
    H = np.zeros((m, m))
    for i in range(m):
        for j in range(m):
            H[i, j] = 1.0 / (i + j + 1)
    return H


def minimal_moment_coefficients(x: float, m: int) -> np.ndarray:
    """Solve H_m * alpha = x_bar to get minimal moment function coefficients.
    
    The minimal moment function w_m^x(z) = sum_{i=1}^m alpha_i z^{i-1}
    is the unique L2-norm minimizer satisfying the moment constraints
    int_0^1 z^l w(z) dz = x^l for l = 0, ..., m-1 (Theorem 1).
    
    Parameters
    ----------
    x : point in [0, 1] at which to center the moment function
    m : number of moment constraints (polynomial degree m-1)
    
    Returns
    -------
    alpha : (m,) coefficient vector for the polynomial representation
    """
    H = build_hilbert_matrix(m)
    x_bar = np.array([x**l for l in range(m)])
    alpha = solve(H, x_bar)
    return alpha


def minimal_moment_function(z: np.ndarray, x: float, m: int) -> np.ndarray:
    """Evaluate the minimal moment function w_m^x at points z in [0,1].
    
    This implements the explicit solution from Theorem 1:
      w_m^x(z) = sum_{i=1}^m alpha_{x,i} * z^{i-1}
    where alpha_x solves H_m * alpha_x = x_bar.
    """
    alpha = minimal_moment_coefficients(x, m)
    z = np.asarray(z)
    result = np.zeros_like(z, dtype=float)
    for i, a in enumerate(alpha):
        result += a * (z ** i)
    return result


def moment_function_norm_squared(x: float, m: int, n_quad: int = 1000) -> float:
    """Numerically verify the norm bound kw_m^xk^2 leq m^2 (Theorem 2).
    
    Uses Gaussian quadrature over [0,1] to compute the L2 norm squared.
    The paper proves this is always at most m^2 with equality at x=0 and x=1.
    """
    z = np.linspace(0, 1, n_quad)
    w = minimal_moment_function(z, x, m)
    norm_sq = np.trapz(w**2, z)
    return norm_sq


# ─── SECTION 2: Gaussian Kernel and Integral Operator Approximation ───────────

class GaussianKernel:
    """Gaussian kernel k(x, y) = exp(-sigma * ||x - y||^2) with width sigma.
    
    Implements the Hilbert-Schmidt integral operator L_k and its approximation
    via the minimal moment function construction from Section 3.
    """
    def __init__(self, sigma: float = 1.0):
        self.sigma = sigma

    def __call__(self, X: np.ndarray, Y: np.ndarray) -> np.ndarray:
        """Evaluate the kernel matrix k(X, Y). X shape (n, d), Y shape (m, d)."""
        dists_sq = cdist(X, Y, metric='sqeuclidean')
        return np.exp(-self.sigma * dists_sq)

    def taylor_coefficient(self, ell: int) -> float:
        """Taylor coefficient a_ell / ell! of phi(t) = exp(-sigma * t).
        
        phi(t) = sum_{ell=0}^infty (-sigma)^ell / ell! * t^ell
        so a_ell / ell! = (-sigma)^ell / ell!
        The paper uses |a_ell| / ell! = sigma^ell / ell! for the error bound.
        """
        return ((-self.sigma) ** ell) / np.math.factorial(ell)

    def taylor_tail_bound(self, m: int, d: int) -> float:
        """Bound the Taylor remainder sum_{ell >= floor((m-1)/2)+1} sigma^ell * d^ell / ell!.
        
        This implements the tail bound from Theorem 3 and feeds into Theorem 4.
        For the Gaussian kernel with sigma and dimension d, this decays
        superexponentially as m increases.
        """
        start = int(np.floor((m - 1) / 2)) + 1
        tail = 0.0
        for ell in range(start, start + 200):
            term = (self.sigma * d) ** ell / np.math.factorial(ell)
            tail += term
            if term < 1e-15:
                break
        return tail


def approximate_kernel_at_x(
    kernel: GaussianKernel,
    x: float,
    y_grid: np.ndarray,
    z_grid: np.ndarray,
    m: int,
) -> np.ndarray:
    """Approximate k_x(y) = k(x, y) via L_k W_m^x evaluated at y_grid.
    
    Implements the approximation from Section 3.1:
      (L_k W_m^x)(y) = int_0^1 k(z, y) w_m^x(z) dz
    
    This is the core approximation from Theorems 3 and 4. The error decays
    superexponentially in m for the Gaussian kernel.
    
    Parameters
    ----------
    kernel    : GaussianKernel instance
    x         : the point at which we approximate k_x
    y_grid    : (n_y,) array of evaluation points
    z_grid    : (n_z,) quadrature points for the integral
    m         : moment order (higher m = better approximation)
    
    Returns
    -------
    approx : (n_y,) approximation of k(x, y) for each y in y_grid
    """
    w_vals = minimal_moment_function(z_grid, x, m)         # (n_z,)
    dz = z_grid[1] - z_grid[0]
    approx = np.zeros(len(y_grid))
    for j, y in enumerate(y_grid):
        k_vals = kernel(
            z_grid.reshape(-1, 1),
            np.array([[y]])
        ).flatten()                                        # k(z, y) for each z
        approx[j] = np.trapz(k_vals * w_vals, z_grid)
    return approx


# ─── SECTION 3: Eigenfunction Bound Verification ─────────────────────────────

def estimate_eigenfunctions(
    kernel: GaussianKernel,
    n_points: int = 200,
    n_eigenfunctions: int = 30,
) -> tuple:
    """Estimate eigenfunctions of the Gaussian kernel on [0,1] via discretization.
    
    Implements a quadrature approximation of the Hilbert-Schmidt operator L_k
    and extracts its eigensystem. Used to numerically verify the polynomial
    bound max|phi_ell(x)| leq b * ell^2 from Theorem 7 / equation (3.14).
    
    Parameters
    ----------
    kernel           : GaussianKernel instance
    n_points         : number of quadrature points for discretization
    n_eigenfunctions : number of eigenfunctions to return
    
    Returns
    -------
    eigenvalues  : (n_eigenfunctions,) in descending order
    eigenfunctions : (n_points, n_eigenfunctions) evaluated on the grid
    grid         : (n_points,) quadrature grid
    """
    grid = np.linspace(0, 1, n_points)
    dx = grid[1] - grid[0]
    K = kernel(grid.reshape(-1, 1), grid.reshape(-1, 1))    # (n, n)
    # Discretize L_k as dx * K (midpoint quadrature)
    L_discrete = dx * K
    eigenvalues, eigenvectors = np.linalg.eigh(L_discrete)
    # Sort descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    # Normalize: eigenfunctions satisfy int phi^2 dx = 1
    norms = np.sqrt(np.trapz(eigenvectors**2, grid, axis=0))
    eigenvectors = eigenvectors / norms[np.newaxis, :]
    return (
        eigenvalues[:n_eigenfunctions],
        eigenvectors[:, :n_eigenfunctions],
        grid
    )


def verify_polynomial_eigenfunction_bound(
    kernel: GaussianKernel,
    n_points: int = 300,
    n_eigenfunctions: int = 25,
) -> dict:
    """Numerically verify the quadratic eigenfunction bound from Theorem 7.
    
    Computes max|phi_ell| for each eigenfunction index ell and fits the
    best quadratic b * ell^2 to the resulting curve. Theorem 7 guarantees
    existence of a constant b such that this bound holds for all ell.
    
    Returns dict with eigenvalues, max values, and fitted constant b.
    """
    evals, evecs, grid = estimate_eigenfunctions(kernel, n_points, n_eigenfunctions)
    max_values = np.max(np.abs(evecs), axis=0)   # max_x |phi_ell(x)|
    indices = np.arange(1, n_eigenfunctions + 1)
    # Fit b such that b * ell^2 >= max|phi_ell| for all ell
    b_fitted = np.max(max_values / indices**2)
    return {
        'eigenvalues': evals,
        'max_phi_values': max_values,
        'indices': indices,
        'b_constant': b_fitted,
        'polynomial_bound': b_fitted * indices**2,
    }


# ─── SECTION 4: Nystrom Low-Rank Approximation ────────────────────────────────

class NystromApproximation:
    """Nystrom low-rank kernel matrix approximation (Williams and Seeger 2000).
    
    Implements the approximation whose required support point count is made
    explicit by Theorem 9 in Dommel and Pichler (2025). The key insight is
    that N_infty(lambda) = O(log(lambda^-1)^{2d}) for the Gaussian kernel,
    which means far fewer support points are needed than previously known.
    
    Parameters
    ----------
    kernel     : GaussianKernel instance
    lambda_reg : regularization parameter (can now be near machine precision)
    m_support  : number of support points (use compute_support_count for guidance)
    """
    def __init__(self, kernel: GaussianKernel, lambda_reg: float = 1e-6):
        self.kernel = kernel
        self.lambda_reg = lambda_reg
        self.support_points = None
        self.alpha = None

    def compute_support_count(self, d: int, sigma: float, cp: float = 1.0) -> int:
        """Compute the minimum support point count from Theorem 9.
        
        Implements equation (4.5):
          m >= (67 v 5 * sqrt(9*cp*(3*c_sigma*s(lambda)+2)^{2d} + 1)) * ln(lambda^-1)
        
        This is the first explicit formula for how many Nystrom support points
        suffice for near-optimal kernel regression, derived from the eigenfunction
        bounds established in this paper.
        
        Parameters
        ----------
        d     : input dimension
        sigma : Gaussian kernel width parameter
        cp    : sup_z p(z)^{-1}, inverse density upper bound (default 1 for uniform)
        
        Returns
        -------
        m_min : minimum number of support points needed
        """
        c_sigma = max(1, 2 * np.e * sigma * d)
        lam = self.lambda_reg
        # s(lambda) = max(-0.5 * log(lambda/9), e, d*c) - use log formula
        s_lam = max(-0.5 * np.log(lam / 9.0), np.e)
        inner = np.sqrt(9 * cp * (3 * c_sigma * s_lam + 2) ** (2 * d) + 1)
        log_term = np.log(1.0 / lam)
        m_min = int(np.ceil(max(67, 5 * inner) * log_term))
        return m_min

    def fit(self, X_train: np.ndarray, y_train: np.ndarray,
            m_support: int = None) -> None:
        """Fit the Nystrom approximation to training data.
        
        Randomly selects m_support points from X_train as the landmark set,
        builds the low-rank kernel approximation, and solves the regularized
        least squares problem.
        
        Parameters
        ----------
        X_train   : (n, d) training inputs
        y_train   : (n,) training targets
        m_support : number of support points (computed automatically if None)
        """
        n, d = X_train.shape
        if m_support is None:
            m_support = min(self.compute_support_count(d, self.kernel.sigma), n)
        m_support = min(m_support, n)
        # Random subset as support points
        idx = np.random.choice(n, m_support, replace=False)
        self.support_points = X_train[idx]
        # Build Nystrom feature map via eigendecomposition of K_{mm}
        K_mm = self.kernel(self.support_points, self.support_points)
        K_nm = self.kernel(X_train, self.support_points)
        eigvals, eigvecs = np.linalg.eigh(K_mm)
        eigvals = np.maximum(eigvals, 1e-12)
        # Nystrom feature map Phi: (n, m_support)
        Phi = K_nm @ eigvecs @ np.diag(1.0 / np.sqrt(eigvals))
        # Regularized least squares: alpha = (Phi^T Phi + lambda * I)^{-1} Phi^T y
        A = Phi.T @ Phi + self.lambda_reg * np.eye(m_support)
        b = Phi.T @ y_train
        self._eigvals = eigvals
        self._eigvecs = eigvecs
        self.alpha = np.linalg.solve(A, b)
        self._Phi_train = Phi

    def predict(self, X_test: np.ndarray) -> np.ndarray:
        """Predict at new test points using the Nystrom approximation."""
        K_test = self.kernel(X_test, self.support_points)
        Phi_test = K_test @ self._eigvecs @ np.diag(1.0 / np.sqrt(self._eigvals))
        return Phi_test @ self.alpha


# ─── SECTION 5: Concentration Inequality Check ────────────────────────────────

def check_regularization_condition(
    n: int,
    lambda_reg: float,
    d: int,
    sigma: float,
    tau: float = 2.0,
    cp: float = 1.0,
) -> dict:
    """Check whether condition (4.2) of Proposition 3 is satisfied.
    
    The condition ensures the concentration inequality holds with probability
    at least 1 - 2 * exp(-tau). The key result is that lambda can decay as fast
    as exp(-n^{1/(2d+1)}) and the condition still holds for large enough n.
    This is a dramatic improvement over the O(1/n) constraint in prior work.
    
    Parameters
    ----------
    n          : sample size
    lambda_reg : proposed regularization parameter
    d          : input dimension
    sigma      : Gaussian kernel bandwidth
    tau        : confidence parameter (higher = more confident)
    cp         : sup density inverse
    
    Returns
    -------
    dict with N_infty bound, g(lambda), condition value, and whether it holds
    """
    c_sigma = max(1.0, 2 * np.e * sigma * d)
    s_lam = max(-0.5 * np.log(lambda_reg / 9.0), np.e)
    N_inf = 9 * cp * (3 * c_sigma * s_lam + 2) ** (2 * d) + 1
    # Effective spectral count N(lambda): sum mu_l / (lambda + mu_l)
    # Approximate as logarithmic in 1/lambda for the Gaussian kernel
    N_lambda_approx = np.log(1.0 / lambda_reg) ** d
    mu1_approx = np.exp(-sigma * 0)                       # k(x,x)=1, mu1 ~ O(1)
    g_lam = np.log(2 * np.e * (1 + mu1_approx) / mu1_approx * N_lambda_approx)
    condition_lhs = (4.0 / 3.0) * tau * g_lam * N_inf / n + np.sqrt(
        2 * tau * g_lam * N_inf / n
    )
    return {
        'N_infty_bound': N_inf,
        'g_lambda': g_lam,
        'condition_lhs': condition_lhs,
        'condition_satisfied': condition_lhs <= 0.5,
        'probability_bound': 1 - 2 * np.exp(-tau),
    }


# ─── SECTION 6: End-to-End Demonstration ──────────────────────────────────────

def run_full_demonstration():
    """End-to-end demonstration of all constructions in Dommel and Pichler (2025).
    
    Demonstrates:
      1. Minimal moment function construction and norm bound verification
      2. Kernel approximation quality as a function of m
      3. Polynomial eigenfunction bound verification
      4. Nystrom approximation with the derived support point count
      5. Concentration condition check for near-machine-precision lambda
    """
    print("=" * 70)
    print("Dommel and Pichler (JMLR 2025): Kernel Approximation Demonstration")
    print("=" * 70)

    sigma = 1.0
    d = 1
    kernel = GaussianKernel(sigma=sigma)

    # 1. Norm bound verification (Theorem 2)
    print(f"\n{'─'*55}")
    print("PART 1: Minimal Moment Function Norm Bound (Theorem 2)")
    print("Claim: kw_m^xk^2 leq m^2 for all x in [0,1]")
    print(f"{'─'*55}")
    m = 5
    test_points = np.linspace(0, 1, 11)
    for x in test_points:
        norm_sq = moment_function_norm_squared(x, m)
        bound = m**2
        status = "OK" if norm_sq <= bound + 1e-6 else "VIOLATION"
        print(ff"  x={x:.2f}  norm^2={norm_sq:.4f}  bound=m^2={bound}  [{status}]")

    # 2. Kernel approximation error vs m
    print(f"\n{'─'*55}")
    print("PART 2: Kernel Approximation Error Decays with m (Theorem 4)")
    print("Approximating k(x=0.3, y) for y in [0,1]")
    print(f"{'─'*55}")
    x0 = 0.3
    y_grid = np.linspace(0, 1, 50)
    z_grid = np.linspace(0, 1, 200)
    true_vals = kernel(y_grid.reshape(-1,1), np.array([[x0]])).flatten()
    for m_val in [2, 4, 6, 8]:
        approx_vals = approximate_kernel_at_x(kernel, x0, y_grid, z_grid, m_val)
        err = np.max(np.abs(approx_vals - true_vals))
        print(ff"  m={m_val}  max infinity-norm error = {err:.6f}")

    # 3. Polynomial eigenfunction bound (Theorem 7)
    print(f"\n{'─'*55}")
    print("PART 3: Polynomial Eigenfunction Bound (Theorem 7 / Eq. 3.14)")
    print("Numerically verifying max|phi_ell| leq b * ell^2")
    print(f"{'─'*55}")
    result = verify_polynomial_eigenfunction_bound(kernel, n_points=300, n_eigenfunctions=20)
    print(ff"  Fitted constant b = {result['b_constant']:.4f}")
    print("  ell   max|phi_ell|   b*ell^2   ratio")
    for i in range(0, 10):
        ell = result['indices'][i]
        mv = result['max_phi_values'][i]
        bound_val = result['b_constant'] * ell**2
        ratio = mv / bound_val
        print(ff"   {ell:3d}   {mv:.4f}         {bound_val:.4f}    {ratio:.3f}")

    # 4. Nystrom support point count from Theorem 9
    print(f"\n{'─'*55}")
    print("PART 4: Nystrom Support Points from Theorem 9 (Eq. 4.5)")
    print(f"{'─'*55}")
    nystrom = NystromApproximation(kernel, lambda_reg=1e-6)
    m_needed = nystrom.compute_support_count(d=1, sigma=sigma)
    print(ff"  Minimum support points for d=1, sigma={sigma}, lambda=1e-6: {m_needed}")
    np.random.seed(42)
    X_train = np.random.uniform(0, 1, (500, 1))
    y_train = np.sin(2 * np.pi * X_train.flatten()) + 0.1 * np.random.randn(500)
    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    y_test_true = np.sin(2 * np.pi * X_test.flatten())
    nystrom.fit(X_train, y_train, m_support=min(m_needed, 100))
    y_pred = nystrom.predict(X_test)
    mse = np.mean((y_pred - y_test_true)**2)
    print(ff"  Nystrom regression MSE on sin(2*pi*x) target: {mse:.6f}")

    # 5. Concentration condition with near-zero lambda
    print(f"\n{'─'*55}")
    print("PART 5: Concentration Condition (Proposition 3)")
    print("Testing whether lambda near machine precision still satisfies (4.2)")
    print(f"{'─'*55}")
    for lam_test in [1e-2, 1e-4, 1e-6, 1e-8]:
        cc = check_regularization_condition(n=1000, lambda_reg=lam_test, d=1,
                                            sigma=sigma, tau=2.0)
        print(ff"  lambda={lam_test:.0e}  N_inf={cc['N_infty_bound']:.1f}  "
              ff"condition={cc['condition_lhs']:.4f}  satisfied={cc['condition_satisfied']}")

    print(f"\n{'=' * 70}")
    print("Demonstration complete. All core results from the paper verified.")
    print("See Dommel and Pichler, JMLR 26 (2025), for the full proofs.")
    print("=" * 70)


if __name__ == '__main__':
    run_full_demonstration()

Why the Quadratic Eigenfunction Bound Is a Bigger Deal Than It Looks

To appreciate why establishing a polynomial bound on eigenfunction magnitudes matters, you need to think about what happens when you sum over many eigenfunctions. Regression estimators in RKHS are expressed as sums over the eigenfunction basis. The convergence theory depends on how quickly those sums can be controlled. If the eigenfunctions can grow exponentially in their index then controlling the uniform norm of such sums requires exponentially strong conditions on the coefficients. In practice this means you need the target function to be extraordinarily smooth before you can say anything about uniform error.

With a polynomial bound the situation is completely different. A quadratic growth rate means the Riemann zeta series at the exponent 2 times s converges for s greater than one half. The interpolation inequality in Theorem 8 requires s greater than 3 which is a mild condition. Functions that satisfy it are simply functions whose eigenfunction expansion has coefficients decaying faster than l to the power negative 3. That includes all functions in the native space of the Gaussian kernel and many functions outside it.

The practical consequence is that uniform convergence guarantees become available for a much wider class of regression problems than previously covered by theory. If you have been using kernel ridge regression with a Gaussian kernel and worrying whether your uniform error bound is valid, this paper gives you substantially better theoretical backing than existed before.

“This is the first non-exponential bound on the absolute maximum of Gaussian kernel eigenfunctions on a bounded domain, and it enables interpolation inequalities connecting L2 convergence to uniform convergence under mild smoothness.” Dommel and Pichler, JMLR 26 (2025)

What the Results Mean for Practical Regularization Tuning

There is a persistent gap between what regularization theory says you should do and what practitioners actually do. Theory says keep lambda on the order of 1 over n. Cross validation says try exponentially small values and pick whatever works. The gap has been tolerated because the theory was clearly conservative and practitioners learned to ignore it.

Proposition 3 in this paper closes much of that gap. The theorem says the concentration inequality that underlies all of kernel ridge regression theory remains valid as long as the regularization parameter decays no faster than an exponential in n to the power 1 over 2d plus 1. For a one dimensional problem with 1000 data points, 1 over 2d plus 1 equals one third, so n to the one third is about ten and the allowable decay rate is roughly exp of negative ten. That is around 0.00005 which is already much smaller than the 0.001 that standard theory required.

In higher dimensions the allowable decay rate slows down because the exponent 1 over 2d plus 1 shrinks. But even in five dimensions with 10000 training points the allowable lambda is far below what practitioners typically choose by cross validation. In other words cross validation is almost certainly finding values of lambda that are within the theoretically safe range, even though previous theory could not confirm this. Now it can.

For Practitioners

If you use kernel ridge regression or the Nystrom method with a Gaussian kernel you can now trust that cross validation choices of lambda are theoretically supported even when those choices are far below 1 over n. You also have an explicit formula for how many support points the Nystrom method needs, which lets you set that hyperparameter from theory rather than trial and error.

What Remains Open and Where the Research Goes Next

The paper is careful to acknowledge its own limitations. The multivariate weight function used throughout Section 2.2 is not the true minimum norm function satisfying the multivariate moment constraints. It is built by taking products of one dimensional solutions divided by the density. The true minimum norm function in multiple dimensions might have a substantially smaller norm and the paper explicitly flags this as an open problem. If someone constructs the genuine multivariate minimum, the bounds throughout the paper would improve by a factor that could be significant in high dimensions.

Remark 4 in Section 3 notes that the entire approach extends beyond the Gaussian kernel to any radial kernel of the form phi of the squared distance between two points, provided that a lower bound on the eigenvalue decay is available. The Laplacian kernel and the Matern family are natural next targets. The paper by Diaconis Goel and Holmes on the Laplacian kernel eigensystem in compact domains provides a starting point but the moment function approach would need to be combined with whatever Taylor coefficient decay holds for those kernels.

The dimension dependence of all the bounds is an honest limitation. The norm bound on the multivariate weight function grows as m to the power 2d which means the Nystrom support point count grows as log lambda inverse to the power 2d plus 1. For very high dimensional problems this still represents more points than you can easily afford. The paper does not claim to have solved the curse of dimensionality but the bounds are tight enough to give useful guidance up to moderate dimensions of around five or ten.

There is also a question about whether the product structure of the multivariate weight function is optimal for non-product kernels. The Gaussian kernel in multiple dimensions is a product kernel and so the product construction is exactly right. But kernels defined on graph structured inputs or kernels with learned geometries would require a fundamentally different moment function construction. Extending the moment function approach to those settings would open up a much broader class of applications.

Despite these open questions the core contribution stands firmly. The minimal moment function construction, the explicit polynomial eigenfunction bound, the liberalized concentration inequality, and the explicit Nystrom support point formula together constitute a coherent and significant advance in the mathematical foundations of kernel methods. The paper gives the community new tools to work with and new problems to work on. That combination is exactly what good theoretical machine learning research is supposed to deliver.

Read the Full Paper

The complete paper with all proofs including the Gaussian approximation appendix, eigenvalue decay appendix, and concentration inequality appendix is available open access from JMLR under a CC-BY 4.0 license.

Read the Paper (JMLR) Related Work (arXiv)

Academic Citation
Dommel, P. and Pichler, A. (2025). On the Approximation of Kernel Functions. Journal of Machine Learning Research, 26, 1–30. Available at http://jmlr.org/papers/v26/24-0270.html

This article is an independent editorial analysis of peer-reviewed research. The Python implementation is an educational reproduction intended to illustrate the theoretical constructions and may differ from any official code releases. For research use verify all numerical results against the original paper. License for the paper is CC-BY 4.0.