With summer approaching (what a perfect time for trips and bouldering), I’m teaching myself about the regret bound of GP-UCB optimization. To enhance my understanding of the topic, I’ve decided to write a blog series. There will be three parts in total (see Part II and Part III). In this part, we will dive into the regret bound of GP-UCB optimization in the context of finite and compact sets. For more details, you can refer to the paper.

Introduction

Problem statement. Let us start with the problem statement. The problem is sequentially optimizing a black-box function \(f \rightarrow D \in \mathbb{R}\). In each round \(t\), we query a point \(x_t\) and evaluate the function value, perturbed by the noise \(\epsilon_t\), i.e., \(y = f(x_t) + \epsilon_t\). Typically, \(\epsilon_t\) is drawn from a normal distribution \(\mathcal{N}(0, \sigma^2)\). We are interested in maximizing \(\sum_{t = 1}^T f(x_t)\) as well as obtaining \(x^\ast = \mathrm{argmax}_{x \in D} f(x)\). Here, \(T\) denotes the total number of rounds.

Regret. Common performance metrics for GP-UCB optimization include instantaneous regret and cumulative regret. For a particular round \(t\), we define the instaneous regret \(r_t = f(x^\ast) - f(x_t)\). The cumulative regret \(R_T\) is the sum of instaneous regrets: \(R_T = \sum_{t = 1}^T r_t\). A desirable asymptotic property is to be no-regret: \(\lim_{t \rightarrow \infty} R_T / T = 0\).

Gaussian process (GP). We model the function \(f\) as a sample of GP: a collection of dependent random variables, each corresponding to a specific input \(x\), and jointly following a multivariate normal distribution. A GP \(GP(\mu(x), k(x, x^\prime))\) is specified by its mean function \(\mu(x) = \mathbb{E}[f(x)]\) and covariance or kernel function \(k(x, x^\prime) = \mathbb{E}[(f(x) - \mu(x)) (f(x^\prime) - \mu(x^\prime))]\). Typically, we assume \(\mu(x) = 0\) for all \(x \in D\).

Upper Confidence Bound (UCB) acquisition function. At each round \(t\), we query \(x_t\) by maximizing the acquisition function (AF). Here, we consider maximizing UCB defined as:

\[\begin{equation} x_t = \mathrm{argmax}_{x \in D} \; \mu_{t - 1}(x) + \beta_t^{1 / 2} \sigma_{t - 1}(x) \end{equation}\]

where \(\beta_t\), \(\mu_{t - 1}\), and \(\sigma_{t - 1}\) denoting exploration-exploitation parameter, posterior mean, and posterior covariance, respectively. UCB prioritizes selecting \(x\) with high uncertainty (large \(\sigma_{t - 1}(x)\)) and at the same time achieve high value (large \(\mu_{t - 1}(x)\)). The parameter \(\beta_t\) negotiates these two objectives.


Illustration of GP-UCB optimization on the sine function, running for 50 rounds with 10 initial points sampled using Sobol initialization. Credit: Anastasiia Makarova.

We now provide the cumulative regret bound for GP-UCB in two different settings:

  1. \(f \sim GP(0, k(x, x^\prime))\) for finite decision set \(D\).
  2. \(f \sim GP(0, k(x, x^\prime))\) for general compact decision set \(D\).

Cumulative Regret Bound on Finite Decision Set

This analysis requires a quantity of maximum information gain \(\gamma_T\) after \(T\) rounds defined as:

\[\begin{eqnarray} && \gamma_T = \underset{A \subset D: \vert A \vert = T}{\max} I(\mathbf{y}_A; \mathbf{f}_A) \nonumber \\ && I(\mathbf{y}_A; \mathbf{f}_A) = H(\mathbf{y}_A) - H(\mathbf{y}_A \vert \mathbf{f}_A) \nonumber \end{eqnarray}\]

with \(\mathbf{f}_A = [f(x)]_{x \in A}\), \(\mathbf{y}_A = \mathbf{f}_A + \varepsilon_A\), \(\varepsilon_A \sim \mathcal{N}(0, I \sigma^2)\). Here, \(I(. ; .)\) and \(H(.)\) denote the information gain and the entropy, respectively. In our case, we have \(I(\mathbf{y}_A; \mathbf{f}_A) = 1 / 2 \log \vert I + \sigma^{-2} \mathbf{K}_A \vert\), where \(\mathbf{K}_A = [k(x, x^\prime)]_{x, x^\prime \in A}\).

Theorem 1 Let \(\delta \in (0, 1)\) and \(\beta_t = 2 \log (\vert D \vert \, t^2 \, \pi^2 / 6 \delta)\). Running GP-UCB with \(\beta_t\) for a sample \(f\) of a GP with mean function zero and covariance function \(k(x, x^\prime)\), we obtain a regret bound of \(\mathcal{O}(\sqrt{T \, \gamma_T \, \log \vert D \vert})\) with high probability. Precisely

\begin{equation} \mathbb{P}(R_T \leq \sqrt{C_1 \, T \, \beta_T \, \gamma_T} \quad \forall T \geq 1) \geq 1 - \delta \nonumber \end{equation}

where \(C_1 = 8 / \log (1 + \sigma^{-2})\).


Proof:

The strategy is to show that \(\vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1/2} \sigma_{t - 1}(x)\) for all \(x \in D\) and \(t \in \mathbb{N}\)

Lemma 5.1 Pick \(\delta \in (0, 1)\) and set \(\beta_t = 2 \log (\vert D \vert \, \pi_t / \delta)\), where \(\sum_{t \geq 1} \pi_t^{-1} = 1, \pi_t > 0\). Then,

\begin{equation} \vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1 / 2} \sigma_{t - 1}(x) \quad \forall x \in D \; \forall t \geq 1 \nonumber \end{equation}

hold with probability \(1 - \delta\).

Proof:

Fix \(t \geq 1\) and \(x \in D\). Conditioned on \(y_{t - 1} = (y_1, \dots, y_{t-1})\), \(\{x_1, \dots, x_{t-1} \}\) are deterministic, and \(f(x) \sim N(\mu_{t-1}(x), \sigma^2_{t-1}(x))\). If \(r \sim N(0, 1)\), then

\[\begin{eqnarray} \mathbb{P}(r > c) &=& e^{-c^2 / 2} (2 \pi)^{- 1 / 2} \int_{r}^{- \infty} e^{-(r - c)^2 / 2 - c (r - c)} \, dr \nonumber \\ &\leq& e^{-c^2 / 2} \mathbb{P}(r > 0) = \frac{1}{2} e^{- c^2 / 2} \nonumber \end{eqnarray}\]

for \(c > 0\), since \(e^{-c (r - c)} \leq 1\) for \(r \geq c\). Hence, \(\mathbb{P}(\vert f(x) - \mu_{t - 1}(x) \vert > \beta_t^{1/2} \sigma_{t - 1}(x)) \leq e^{- \beta_t / 2}\), using \(r = (f(x) - \mu_{t - 1}(x)) / \sigma_{t - 1}(x)\) and \(c = \beta_t^{1/2}\). Applying the union bound,

\[\begin{equation} \vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1/2} \sigma_{t - 1}(x)\nonumber \end{equation}\]

holds with probability \(\geq 1 - \vert D \vert e^{- \beta_t / 2}\). Choosing \(\vert D \vert e^{- \beta_t / 2} = \delta / \pi_t\) and using the union bound for \(t \in \mathbb{N}\), the statement holds. In order to satisfy the theorem, we choose \(\pi_t = \pi^2 t^2 / 6\).

Lemma 5.2 Fix \(t \geq 1\). If \(\vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1 / 2} \sigma_{t - 1}(x)\) for all \(x \in D\), then the regret \(r_t\) is bounded by \(2 \beta_t^{1/2} \sigma_{t - 1}(x_t)\).

Proof:

By definition \(x_t : \mu_{t-1}(x_t) + \beta_t^{1/2} \sigma_{t-1}(x_t)\geq \mu_{t-1}(x^\ast) + \beta_t^{1/2} \sigma_{t-1}(x^\ast) \geq f(x^\ast)\). Therefore,

\[\begin{equation} r_t = f(x^\ast) - f(x_t) \leq \mu_{t-1}(x_t) + \beta_t^{1/2} \sigma_{t-1}(x_t) - f(x_t) \leq 2 \beta_t^{1/2} \sigma_{t-1}(x^\ast) \nonumber \end{equation}\]

Lemma 5.3 The information gain for the points selected can be expressed in terms of predictive variances. If \(\mathbf{f}_T = (f(x_t)) \in \mathbb{R}^T\):

\[\begin{equation} I(\mathbf{y}_T; \mathbf{f}_T) = \frac{1}{2} \sum^T_{t = 1} \log (1 + \sigma^{-2} \, \sigma_{t-1}(x_t)) \nonumber \end{equation}\]

Proof:

Recall that \(I(\mathbf{y}_T; \mathbf{f}_T) = H(\mathbf{y}_T) - 1/2 \log \vert 2 \pi e \sigma^2 I \vert\). Now, \(H(\mathbf{y}_T) = H(\mathbf{y}_{T-1}) + H(y_T \vert \mathbf{y}_{T - 1}) = H(\mathbf{y}_{T-1}) + \frac{1}{2} \log (2 \pi e(\sigma^2 + \sigma_{t - 1}^2(x_T)))\). The result follow by induction.

Lemma 5.4 Pick \(\delta \in (0, 1)\) and let \(\beta_t\) be defined as in Lemma 5.1. Then, the following holds with probability \(\geq 1 - \delta\):

\[\begin{equation} \sum_{t = 1}^T r_t^2 \leq \beta_T \, C_1 \, I(\mathbf{y}_T; \mathbf{f}_T) \leq C_1 \beta_T \gamma_T \quad \forall T \geq 1, \nonumber \end{equation}\]

where \(C_1 \triangleq 8 / \log(1 + \sigma^{-2}) \geq 8 \sigma^2\).

Proof:

By Lemma 5.1 and Lemma 5.2, we have that \(\{ r_t^2 \leq 4 \beta_t \, \sigma_{t - 1}^2(x_t) \forall t \geq 1 \}\) with probability \(\geq 1 - \delta\). \(\beta_t\) is non-decreasing such that

\[\begin{eqnarray} 4 \beta_t \, \sigma_{t - 1}^2(x_t) &\leq& 4 \beta_T \, \sigma^2 ( \sigma^{-2} \sigma_{t - 1}^2(x_t) ) \nonumber \\ &\leq& 4 \beta_T \, \sigma^2 C_2 \log (1 + \sigma^{-2} \sigma_{t - 1}^2(x_t)) \nonumber \end{eqnarray}\]

with \(C_2 = \sigma^{-2} / \log (1 + \sigma^{-2}) \geq 1\), since \(s^2 \leq C_2 \log (1 + s^2)\) for \(s \in [0, \sigma^{-2}]\), and \(\sigma^{-2} \sigma_{t - 1}^2(x_t) \leq \sigma^{-2} k(x_t, x_t) \leq \sigma^{-2}\). Noting that \(C_1 = 8 \sigma^2 C_2\), the result follows by plugging in the representation of Lemma 5.3.

Theorem 1 is a consequence of Lemma 5.4, since \(R_T^2 \leq T \sum_{t = 1}^T r_t^2\) by Cauchy-Schwarz inequality.

The key insight here is that with high probability over samples from the GP, the cumulative regret is bounded in terms of the maximum information gain.

Cumulative Regret Bound on Compact Decision Set

We can generalize the result to any compact and convex \(D \subset \mathbb{R}^d\) under the assumptions on the kernel \(k\).

Theorem 2 Let \(D \subset [0, r]^d\) be compact and convex, \(d \in \mathbb{N}, r > 0\). Suppose that the kernel \(k(x, x^\prime)\) satisfies the following high probability bound on the derivatives of GP sample paths \(f\): for some constants \(a, b > 0\),

\[\begin{equation} \mathbb{P}(\underset{x \in D}{\sup} \vert \partial f / \partial x_j \vert > L) \leq a e^{-(L / b)^2}, j = 1, \dots, d \nonumber \end{equation}\]

Pick \(\delta \in (0, 1)\), and define

\[\begin{equation} \beta_t = 2 \log (t^2 2 \pi^2 / (3 \delta)) + 2d \log \left(t^2 d b r \sqrt{\log (4 d a / \delta)} \right) \nonumber \end{equation}\]

Running the GP-UCB with \(\beta_t\) for a sample \(f\) of a GP with mean function zero and covariance function \(k(x, x^\prime)\), we obtain a regret bound of \(\mathcal{O}^\ast(\sqrt{d T \gamma_T})\) with high probability. Precisely, with \(C_1 = 8 / \log ( 1 + \sigma^{-2} )\) we have

\[\begin{equation} \mathbb{P}(R_T \leq \sqrt{C_1 \, T \, \beta_T \, \gamma_T} + 2 \quad \forall T \geq 1) \geq 1 - \delta \nonumber \end{equation}\]

Proof:

Lemma 5.5 Pick \(\delta \in (0, 1)\) and set \(\beta_t = 2 \log ( \pi_t / \delta)\), where \(\sum_{t \geq 1} \pi_t^{-1} = 1, \pi_t > 0\). Then,

\begin{equation} \vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1 / 2} \sigma_{t - 1}(x) \quad \forall x \in D \; \forall t \geq 1 \nonumber \end{equation}

hold with probability \(1 - \delta\).

Proof:

Fix \(t \geq 1\) and \(x \in D\). Conditioned on \(y_{t - 1} = (y_1, \dots, y_{t-1})\), \(\{x_1, \dots, x_{t-1} \}\) are deterministic, and \(f(x) \sim N(\mu_{t-1}(x), \sigma^2_{t-1}(x))\). As before, \(\mathbb{P}(\vert f(x) - \mu_{t - 1}(x) \vert > \beta_t^{1/2} \sigma_{t - 1}(x)) \leq e^{- \beta_t / 2}\). Since \(e^{- \beta_t / 2} = \delta / \pi_t\) and using the union bound for \(t \in \mathbb{N}\), the statements hold.

For the purpose of the analysis, we discretize \(D_t \subset D\), where \(D_t\) will be used at round \(t\) in the analysis. We need \(D_t\) to obtain the confidence interval on \(x^\ast\).

Lemma 5.6 Pick \(\delta \in (0, 1)\) and set \(\beta_t = 2 \log ( \vert D_t \vert \pi_t / \delta)\), where \(\sum_{t \geq 1} \pi_t^{-1} = 1, \pi_t > 0\). Then,

\begin{equation} \vert f(x) - \mu_{t - 1}(x) \vert \leq \beta_t^{1 / 2} \sigma_{t - 1}(x) \quad \forall x \in D \; \forall t \geq 1 \nonumber \end{equation}

hold with probability \(1 - \delta\).

Proof:

The proof is identical to that in Lemma 5.1, except now we use \(D_t\) at each timestep.

By assumption and union bound, we have

\[\begin{equation} \mathbb{P}(\forall j, \forall x \in D, \vert \partial f / \partial x_j \vert < L) \geq 1 - d a e^{-(L / b)^2} \nonumber \end{equation}\]

which implies that with probability \(\geq 1 - d a e^{-(L / b)^2}\), we have

\[\begin{equation}\label{eq:lipschitz} \forall x \in D, \vert f(x) - f(x^\prime) \vert \leq L \Vert x - x^\prime \Vert_1 \end{equation}\]

Let us choose a discretization \(D_t\) of size \((\tau_t)^d\) so that for all \(x \in D_t\)

\[\begin{equation} \Vert x - [x]_t \Vert_1 \leq r d / \tau_t \nonumber \end{equation}\]

where \([x]_t\) denotes the closest point in \(D_t\) to \(x\). A sufficient discretization has each coordinate with \(\tau_t\) uniformly spaced points.

Lemma 5.7 Pick \(\delta \in (0, 1)\) and set \(\beta_t = 2 \log (2 \pi_t / \delta) + 4 d \log(d t b r \sqrt{\log(2 d a / \delta)})\), where \(\sum_{t \geq 1} \pi_t^{-1} = 1, \pi_t > 0\). Let \(\tau_t = d t^2 b r \sqrt{\log(2 d a / \delta)}\). Let \([x^\ast]_t\) denotes the closest point in \(D_t\) to \(x^\ast\). Then,

\[\begin{equation} \vert f(x^\ast) - \mu_{t - 1}([x^\ast]_t) \vert \leq \beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t) + \frac{1}{t^2} \forall t \geq 1 \nonumber \end{equation}\]

holds with probability \(\geq 1 - \delta\)

Proof: Using \eqref{eq:lipschitz}, we have with the probability \(\geq 1 - \delta / 2\),

\[\begin{equation} \forall x \in D, \vert f(x) - f(x^\prime) \vert \leq b \sqrt{\log(2 d a / \delta)} \Vert x - x^\prime \Vert_1 \nonumber \end{equation}\]

Hence,

\[\begin{equation} \forall x \in D_t, \vert f(x) - f([x]_t) \vert \leq r d b \sqrt{\log(2 d a / \delta)} / \tau_t \nonumber \end{equation}\]

By choosing \(\tau_t = d t^2 b r \sqrt{\log(2 d a / \delta)}\), we have that

\[\begin{equation} \forall x \in D_t, \vert f(x) - f([x]_t) \vert \leq \frac{1}{t^2} \nonumber \end{equation}\]

This implies that \(\vert D_t \vert = (d t^2 b r \sqrt{\log(2 d a / \delta)})^d\). Using \(\delta / 2\) in Lemma 5.6, we can apply the confidence bound to \([x^\ast]_t\) (as this lives in \(D_t\)) to obtain the result.

Lemma 5.8 Pick \(\delta \in (0, 1)\), and set \(\beta_t = 2 \log(4 \pi_t / \delta) + 4 d \log (dtbr \sqrt{\log (4 da / \delta)})\), where \(\sum_{t \geq 1} \pi_t^{-1} = 1, \pi_t > 0\). Then, with probability greater than \(1 - \delta\), for all \(t \in \mathbb{N}\), the regret is bounded as follows:

\[\begin{equation} r_t \leq 2 \beta_t^{1 / 2} \sigma_{t - 1}(x_t) + \frac{1}{t^2} \nonumber \end{equation}\]

Proof:

Use \(\delta / 2\) in both Lemma 5.5 and Lemma 5.7, so that these events hold with probability greater than \(1 - \delta\). By definition of \(x_t : \mu_{t - 1}(x_t) + \beta_t^{1 / 2} \sigma_{t - 1}(x_t) \geq \mu_{t - 1}([x^\ast]_t) + \beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t)\). By Lemma 5.7, we have that \(\mu_{t - 1}([x^\ast]_t) + \beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t) + 1/t^2 \geq f(x^\ast)\), which implies \(\mu_{t - 1}([x^\ast]_t) + \beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t) \geq f(x^\ast) - 1/t^2\). Therefore,

\[\begin{eqnarray} r_t &=& f(x^\ast) - f(x_t) \nonumber \\ &\leq& \mu_{t - 1}([x^\ast]_t) + \beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t) + 1/t^2 - f(x_t) \nonumber \\ &\leq& 2\beta_t^{1 / 2} \sigma_{t - 1}([x^\ast]_t) + 1/t^2 \nonumber \end{eqnarray}\]

As shown in the proof of Lemma 5.4, with the probability \(\geq 1 - \delta\),

\[\begin{equation} \sum_{t = 1}^T 4 \beta_t \sigma^2_{t - 1}(x_t) \leq C_1 \beta_T \gamma_T \quad \forall T \geq 1, \nonumber \end{equation}\]

By Cauchy-Schwartz,

\[\begin{equation} \sum_{t = 1}^T 2 \beta_t^{1 / 2} \sigma_{t - 1}(x_t) \leq \sqrt{C_1 \beta_T T \gamma_T} \quad \forall T \geq 1, \nonumber \end{equation}\]

Hence,

\[\begin{equation} \sum_{t = 1}^T r_t \leq \sqrt{C_1 \beta_T T \gamma_T} + \pi^2/6 \quad \forall T \geq 1, \nonumber \end{equation}\]

since \(\sum 1 / t^2 = \pi^2 / 6\). Theorem 2 now follows.

Some kernels that satisfy the condition in the above theorem including Matérn and squared exponential kernel.

In the next part, we will generalize the function \(f\) to be an arbitrary function from the reproducing kernel Hilbert space (RKHS) corresponding to the kernel \(k(x, x^\prime)\).

Reference

  • Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.