Tervetuloa! In this final part of the blog series (see Part I and Part II), we are going to derive the bound of the quantity \(\gamma_T\) for practical classes of kernels.

Submodularity and Greedy maximization

Theorem 4 For any \(T \in \mathbb{N}\) and any \(T_\ast = 1, \dots, T:\)

\begin{equation} \gamma_T \leq \mathcal{O}(\sigma^{-2} [B(T_\ast) T + T_\ast (\log n_T T)]) \nonumber \end{equation}

where \(n_T = \sum_{t = 1}^{\vert D \vert} \hat{\lambda}_t\) and \(B(T_\ast) = \sum_{t = T_\ast + 1}^{\vert D \vert} \hat{\lambda}_t\)


Proof:

The sketch of the proof is as follows:

  • First, note that the information gain \(I(\mathbf{y}_A; \mathbf{f}_A)\) is a submodular function. Therefore, \(\gamma_T\) can be bounded via greedy maximization.
  • Subsequently, we utilize the discretization \(D_t \subset D\) with \(n_T = \vert D_T \vert = T^\tau\) with nearest neighbor distance \(o(1)\). We consider the kernel matrix \(\mathbf{K}_{D_T} \in \mathbb{R}^{n_T \times n_T}\) and bound \(\gamma_T\) by an expression involving the eigenvalues \(\{ \hat{\lambda}_t \}\) of this matrix.
  • Finally, we obtain the bound of this empirical expression in terms of the kernel operator eigenvalues of \(k\).

Greedy Maximization and Discretization

Assumptions. Fix \(T \in \mathbb{N}\) and assume there exists a discretization \(D_T \subset D, n_T = \vert D_T \vert\) on the order of \(T^\tau\) s.t.

\[\begin{eqnarray} \forall x \in D \exists [x]_T \in D_T : \Vert x - [x]_T \Vert = \mathcal{O}(T^{- \tau / d}) \label{eq:discretization} \end{eqnarray}\]

We restrict the information gain to subsets \(A \subset D_T\):

\[\begin{eqnarray} \tilde{\gamma}_T = \max_{A \subset D_T, \vert A \vert = T} I(\mathbf{y}_A; \mathbf{f}_A) \nonumber \end{eqnarray}\]

Lemma 7.4 Under the assumptions of Theorem 2, the information gain \(F_T(\{ x_t \}) = 1 / 2 \log \vert \mathbf{I} + \sigma^{-2} \vert\) is uniformly Lipschitz-continuous in each component \(x_t \in D\).

Proof:

Theorem 2 implies that the kernel \(k\) is continuously differentiable. The result follows from the fact that \(F_T(\{ x_t \})\) is continuously differentiable in the kernel matrix \(\mathbf{K}_{\{x_t\}}\).


Lemma 7.5 Let \(D_T\) be a discretization of \(D\) such that \eqref{eq:discretization} holds. Under the assumption of Theorem 2, we have that

\begin{equation} 0 \leq \gamma_T - \tilde{\gamma}_T = \mathcal{O}(T^{1 - \tau / d}) \nonumber \end{equation}

Proof:

Fix \(T \in \mathbb{N}\). Let \(A = \{x_1, \dots, x_T\}\) be an optimum maximizer for \(\gamma_T\). Let \([A]_T = {[x_t]_T}_{t = 1}^T\). Then,

\[\begin{eqnarray} 0 \leq \gamma_T - \tilde{\gamma}_T \leq \gamma_T - I(\mathbf{y}_{[A]_T}; \mathbf{f}_{[A]_T}) = F_T(A) - F_T([A]_T) \nonumber \end{eqnarray}\]

By Lemma 7.4, \(F_T\) is uniformly Lipschitz-continuous in each component, s.t. \(\vert \gamma_T - I(\mathbf{y}_{[A]_T}; \mathbf{f}_{[A]_T}) \vert = \mathcal{O}(T \max_{t} \Vert x_t - [x_t]_T \Vert) = \mathcal{O}(T^{1 - \tau / d})\) by \eqref{eq:discretization} and the mean value theorem.


With the following lemma, we obtain the upper-bound of \(\tilde{\gamma}_T\).

Lemma 7.6 For any \(T \geq 1\), we have that

\[\begin{eqnarray} \hat{\gamma}_T \leq \frac{1 / 2}{1 - e^{-1}} \max_{m_1, \dots, m_T} \sum_{t = 1}^T \log(1 + \sigma^{-2} m_t \hat{\lambda}_t) \nonumber \end{eqnarray}\]

Empirical to Process Eigenvalues

  • Let \(\mu(x) = \mathcal{V}(D)^{-1} I_{x \in D}\) be the uniform distribution on \(D\), \(\mathcal{V}(D) = \int_{x \in D} dx\) and assume that \(k\) is continuous.
  • Assuming \(k(x, x) = 1\), we have \(\int k(x, x) \mu(x) dx = 1\), so that \(k\) is Hilbert-Scmidt on \(L_2(\mu)\).
  • Mercer’s theorem: corresponding kernel operator has a discrete eigenspectrum \(\{(\lambda_s, \phi_s(.))\}\) and
\[\begin{eqnarray} k(x, x^\prime) = \sum_{s \geq 1} \lambda_s \phi_s(x) \phi_s(x^\prime) \nonumber \end{eqnarray}\]

where \(\lambda_1 \geq \lambda_2 \geq \dots \geq 0, \mathbb{E}_\mu[\phi_s(x) \phi_t(x)] = \delta_{s, t}\).


The following lemma determines the sizes \(n_T\) for which the discretizations above exist.

Lemma 7.7 Fix \(T \in \mathbb{N}, \delta > 0\) and \(\epsilon > 0\). There exists a discretization \(D_T \subset D\) of size

\begin{equation} n_T = \mathcal{V}(D) (\epsilon / \sqrt{d})^{- d}[\log (1 / \delta) + d \log(\sqrt{d} / \epsilon) + \log \mathcal{V}(D)] \nonumber \end{equation}

which fulfils the following requirements:

  • \(\epsilon\)-denseness: For any \(x \in D\), there exists \([x]_T \in D_T\) such that \(\Vert x - [x]_T \Vert \leq \epsilon\).

  • If \(\mathrm{spec}(\mathbf{K}_{D_T}) = \{\hat{\lambda}_1 \geq \hat{\lambda}_2 \geq \dots \}\), then for any \(T_\ast = 1, \dots, n_T\):

\[\begin{eqnarray} n_T^{-1} \sum_{t = 1}^{T_\ast} \hat{\lambda}_t \geq \sum_{t = 1}^{T_{\ast}} \lambda_t - \delta \nonumber \end{eqnarray}\]

The following lemma is equivalent to Theorem 4 in the context where this lemma is a direct consequence of Lemma 7.6.

Lemma 7.8 Let \(D_T\) be some discretization of \(D, n_T = \vert D_T \vert\). For any \(T_\ast = 1, \dots, \min\{T, n_T\}\) :

\[\begin{eqnarray} \tilde{\gamma}_T \leq \frac{1 / 2}{1 - e^{-1}} \max_{r = 1, \dots, T} \left( T_\ast \log (r n_T / \sigma^2) + (T - r) \sigma^{-2} \sum_{t = T_\ast + 1}^{n_T} \hat{\lambda}_t \right) \nonumber \end{eqnarray}\]

Proof:

Split the right hand side in Lemma 7.6 at \(t = T_\ast\). Let \(r = \sum_{t \leq T_\ast} m_t\).

  • For \(t \leq T_\ast\): \(\log (1 + m_t \hat{\lambda}_t / \sigma^2) \leq \log (r n_T / \sigma^2)\), since \(\hat{\lambda}_t \leq n_T\).
  • For \(t > T_\ast\): \(\log (1 + m_t \hat{\lambda}_t / \sigma^2) \leq (T - r) \hat{\lambda}_t / \sigma^2\).

The following theorem is responsible for obtaining bounds on \(\gamma_T\) for a particular kernel \(k\), given that tail bounds on \(B_k(T_\ast) = \sum_{s > T_\ast} \lambda_s\) are known.

Theorem 8 Suppose that \(D \subset \mathbb{R}^d\) is compact, and \(k(x, x^\prime)\) is a covariance function for which the additional assumption of Theorem 2 holds. Moreover, let \(B_k(T_\ast) = \sum_{s > T_\ast} \lambda_s\), where \(\{\lambda_s\}\) is the operator spectrum of \(k\) with respect to the uniform distribution over \(D\). Pick \(\tau > 0\), and let \(n_T = C_4 T^\tau (\log T)\) with \(C_4 = 2 \mathcal{V}(D)(2 \tau + 1)\). Then, the following bound holds:

\[\begin{eqnarray} \gamma_T \leq \frac{1 / 2}{1 - e^{-1}} \max_{r = 1, \dots, T} \left(T_\ast \log (r n_T / \sigma^2) + C_4 \sigma^{-2} (1 - r / T) (\log T) (T^{\tau + 1} B_k(T_\ast) + 1) \right) + \mathcal{O}(T^{1 - \tau / d}) \nonumber \end{eqnarray}\]

for any \(T_\ast \in \{1, \dots, n_T \}\).

Proof:

Let \(\epsilon = d^{1 / 2} T^{- \tau / d}\) and \(\delta = T^{- (\tau + 1)}\). Lemma 7.7 provides the existence of a discretization \(D_T\) of size \(n_T\) which is \(\epsilon-\)dense and for which \(n_T^{-1} \sum_{t = 1}^{T_\ast} \hat{\lambda}_t \geq \sum_{t = 1}^{T_{\ast}} \lambda_t - \delta\). Since \(n_T^{-1} \sum_{t = 1}^{n_T} \hat{\lambda}_t = 1 = \sum_{t \geq 1} \hat{\lambda} \lambda_t\), then \(\). The statement follows by using Lemma 7.8 with these bounds, and finally employing Lemma 7.5.

Bounds for Kernels

Next, we bound \(\gamma_T\) based on Theorem 8 for a range of commonly used kernel functions: linear kernel, squared exponential kernel, and Matérn kernels. The results imply sublinear regret bound for GP-UCB in all cases.

Linear Kernel. GP with this kernel corresponds to random linear function \(f(x) = w^\top x, \quad w \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\)

\[\begin{eqnarray} k(x, x^\prime) = x^\top x^\prime \nonumber \end{eqnarray}\]

Squared exponential kernel. Sample functions are differentiable to any order almost surely.

\[\begin{eqnarray} k(x, x^\prime) = \exp(- \Vert x - x^\prime \Vert^2 / 2 \ell^2 ) \nonumber \end{eqnarray}\]

where \(\ell\) is the lengthscale parameter.

Matérn kernel.

\[\begin{eqnarray} k(x, x^\prime) = (2^{1 - \nu} / \Gamma(\nu)) r^\nu B_\nu(r), \quad r = (\sqrt{2 \nu} / \ell) \Vert x - x^\prime \Vert \nonumber \end{eqnarray}\]

where \(B_\nu\) is the modified Bessel function. \(\nu\) controls the smoothness of sample paths


Theorem 5 Let \(D \in \mathbb{R}^d\) be compact and convex, \(d \in \mathbb{N}\). Assume the kernel \(k(x, x^\prime) \leq 1\).

  1. Finite spectrum. For the \(d\)-dimensional Bayesian linear regression case: \(\gamma_T = \mathcal{O}(d \log T)\).
  2. Exponential spectral decay. For the squared exponential kernel: \(\mathcal{O}((\log T)^{d + 1})\).
  3. Power law spectral decay. For the Matérn kernels with \(\nu > 1\): \(\mathcal{O}(T^{d(d + 1) / (2 \nu + d (d + 1))})\).

Sketch of proof:

\(\gamma_T\) is bounded by Theorem 4 (See Part II) in terms the eigendecay of the kernel matrix \(\mathbf{K}_D\). If \(D\) is infinite or very large, we can use the operator spectrum of \(k(x, x_0)\), which likewise decays rapidly. For the kernels of interest here, there exist asymptotic expressions for the operator eigenvalues. The key of the proof is to show the existence of discretization \(D_T \subset D\), dense in the limit, for which the tail sum \(B(T_\ast) / n_T\) in Theorem 4 are close to corresponding operator.

Reference

  • Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.