In this blog, I share my solutions on Mackay 2003 exercises chapter 27 page 342. Disclaimer: I don’t guarantee the validity of each answer. All of the answers are based on my own. However, I am also open for corrections. If you spot any flaws, feel free to contact me via email: arijonamarshal@gmail.com or marshal.arijona01@ui.ac.id. All of the notations follow the text book. For further details, please refer to https://www.inference.org.uk/itprnn/book.pdf.

Problem 1

A photon counter is pointed at a remote star for one minute, in order to infer the rate of photons arriving at the counter per minute, λ. Assuming the number of photons collected r has a Poisson distribution with mean λ

p(r|λ)=exp(λ)λrr!

and assuming the improper prior P(λ) = 1/λ, make Laplace approximations to the posterior distribution

(a) over λ

(b) over logλ. [Note the improper prior transforms to p(logλ) is constant.

Problem 1a

First, let us compute the unnormalized posterior distribution p(λ|r) and its log respectively:

p(λ|r)=p(r|λ)p(λ)=exp(λ)λr1r!

logp(λ|r)=λ(r1)logλ+logr!

Recall that we need the mode of distribution in order to perform Laplace approximation. We can compute the mode of p(λ|r) via maximum a posteriori (MAP). MAP requires us to find λMAP such that:

λMAP=minλlogp(λ|r)

we can find MAP by deriving logp(λ|r), setting it to zero, and finally solving the equation for λ:

dlogp(λ|r)dλ=1r1λ=0λMAP=r1

Now we can obtain the second order derivative c on λ=λMAP:

c=|d2logp(λ|r)dλ2|λ=λMAP=1((r1)λ2)=1r1

We are ready to construct our approximate distribution. First, let’s construct the unnnormalized approximation q(λ):

q(λ)p(λ=λMAP|r)exp(c2(λλMAP)2)=p(λ=λMAP|r)exp(12(r1)(λ(r1))2)

Our normalization factor Zq can be written as:

Zq=p(λ=λMAP|r)2πc=p(λ=λMAP|r)2π(r1)

Therefore, we obtain q(λ)=12π(r1)exp(12(r1)(λ(r1))2)=N(r1,r1).

Problem 1b

From the description, λ is transformed through the function u(λ)=logλ. Subsequently, the density of λ is transformed to p(u)=p(x)|dλdu|=p(x)λ. Using this rule, we obtain our unnormalized posterior p(u(λ)|r) as follows:

p(u(λ)|r)=p(u(λ))p(r|u(λ))(1)=exp(λ)λrr!λ=exp(λ)λr+1r!

Observe that our unnormalized transformed posterior is just the transforming likelihood since our transformed prior is just a constant. Now, taking the log of p(u(λ)|r) gives us:

(2)logp(u(λ)|r)=λ(r+1)logλ+logr!

Now, let’s derive u(λ)MAP:

dlogp(u(λ)|r)du(λ)=exp(logλ)(r+1)=0u(λ)MAP=log(r+1)

We then derive our second derivative c on u(λ)=u(λ)MAP:

c=|d2logp(u(λ)|r)du(λ)2|u(λ)=u(λ)MAP=exp(log(r+1))

We are ready to construct our approximate distribution. First, let’s construct the unnnormalized approximation q(u(λ)):

q(u(λ))p(u(λ)=u(λ)MAP|r)exp(c2(u(λ)u(λ)MAP)2)=p(u(λ)=u(λ)MAP|r)exp((r+1)2(u(λ)log(r+1))2)

Our normalization factor Zq can be written as:

Zq=p(u(λ)=u(λ)MAP|r)2πc=p(u(λ)=u(λ)MAP|r)2πr+1

Therefore, we obtain q(u(λ))=12π(r+1)exp((r+1)2(u(λ)log(r+1))2)=N(log(r+1),1r+1).

Problem 2

Use Laplace’s method to approximate the integral

Z(u1,u2)=f(a)u1(1f(a))u2da

where f(a)=1/(1+ea) and u1, u2 are positive. Check the accuracy of the approximation against the exact answer (23.29, p.316) for (u1,u2)=(1/2,1/2) and (u1,u2)=(1,1). Measure the error (logZplogZq) in bits

Answer:

Let us define p(a)=f(a)u1(1f(a))u2. Therefore, we have logp(a)=u1logf(a)+u2log(1f(a)). Next task is to determine the mode a0 of p(a). Setting the derivative of logp(a) to 0 and solving for a, we obtain:

dlogp(a)da=(u1f(a)f(a)u21f(a)f(a)=0)=(u1(1f(a))u2f(a))a0=logu2u1

f(a) refers to df(a)da. The second row comes from the fact that f(a)=f(a)(1f(a)). Next, let us determine the second derivation.

d2logp(a)da2=(u1+u2)f(a)=(u1+u2)f(a)(1f(a))

Now, we aim to evaluate d2logp(a)da2 at a=a0. Note that we have f(a0)=u1u1+u2 and 1f(a)=u2u1+u2. Therefore

c=|d2logp(a)da2|a=a0=(u1+u2)u1u1+u2u2u1+u2=u1u2u1+u2

The normalizing constant can be approximated by:

ZpZq=p(a0)2πc=(u1u1+u2)u1(u2u1+u2)u22π(u1+u2)u1u2

It’s time for evaluation !! for (u1,u2)=(1,1), we have:

Zq(u1=1,u2=1)=12×12×2π=π2

Zp(u1=1,u2=1)=Γ(1)Γ(1)Γ(1+1)=1

with the error:

logZpZq=log2π=0.15bit

For the error measurement, we use base 2 logarithm. In other cases we use natural number.

while for (u1,u2)=(12,12), we have:

Zq(u1=1/2,u2=1/2)=121/2×121/2×8π=2π

Zp(u1=1/2,u2=1/2)=Γ(1/2)Γ(1/2)Γ(1)=1.772=3.313

with the error: logZpZq=log3.3132.5=0.12bit

Problem 3

Linear regression. N datapoints {xn,tn}n=1N are generated by the experimenter choosing each xn, then the world delivering a noisy version of the linear function

y(x)=w0+w1xtnN(y(xn),σν2)

Assuming Gaussian priors on w0 and w1, make the Laplace approximation to the posterior distribution of w0 and w1 (which is exact, in fact) and obtain the predictive distribution for the next datapoint tN+1, given xN+1

Answer :

Suppose that w=(w0,w1), then our prior will be as follows:

p(w)=N(w;0,I)exp(12wTw)logp(w)12wTw

Suppose that we have (x,t)={xn,tn}n=1N training-data. Since our likelihood is a Gaussian which depends on x,w0, and w1, we obtain the likelihood as follows:

p(t|x,w0,w1)n=1Nexp(12σν2(tny(xn))2)=exp(12σν2n=1N(tny(xn))2)logp(t|x,w0,w1)12σν2n=1N(tny(xn))2

Having prior and likelihood. We are ready to compute the log of unnormalized posterior. We only need to apply the Bayes theorem to do so.

logp(w|x,t)logp(t|x,w)+logp(w)=12σν2n=1N(tny(xn))212wTw

Let’s define the unnormalized posterior as p(w|x,t). The rest of the solution is solving the Laplace approximation. The first step is to obtain wMAP via first derivation of logp(w|x,t)

dlogp(w|x,t)dw=[dlogp(w|x,t)dw0dlogp(w|x,t)dw1]=[1σν2n=1N[tn(w0+w1xn)]+w01σν2n=1N[tn(w0+w1xn)]xn+w1] wMAP=[w0MAPw1MAP]=1(n+σν2)(n=1Nxn+σν2)(n=1Nxn)2[(n=1N(xn)2+σν2)n=1Ntnn=1Nxnn=1Nxntnn=1Nxnn=1Ntn+(n+σν2)n=1Nxntn]

Next step is to obtain the Hessian matrix H of logp(w|x,t):

H=[d2logp(w|x,t)dw02d2logp(w|x,t)dw0dw1d2logp(w|x,t)dw1dw0d2logp(w|x,t)dw12]=[nσν2+1n=1Nxnσν2n=1Nxnσν2n=1N(xn)2σν2+1]

Now, we can approximate p(w|x,t) with a distribution q(w): q(w)=p(wMAP)exp[12(wwMAP)TH(wwMAP)]

Subsequently, we obtain the normalization factor Zq

Zq=p(wMAP)(2π)KdetH

with K denotes the dimensionality of w. Finally our approximate distribution follows a normal distribution

q(w)=q(w)Zq=1(2π)2det Hexp[12(wwMAP)TH(wwMAP)]=N(w;wMAP,H)

Given a new data point (xN+1), we compute the approximate predictive distribution:

p(tN+1|xN+1)=p(tN+1|xN+1,w)q(w)dw

However, compute this integral is often intractable. We can utilize Monte Carlo estimation to simplify the computation.

p(tN+1|xN+1)i=1Mp(tN+1|xN+1,wi),wiq(w)

with M is the number of samples w. The more samples we use the more accurate is the prediction.