\[ \newcommand{\ind}{\mathbb{1}} \]
In this post, I am trying to come up with a simple introduction to NS (nested sampling), through the lens of SMC samplers. It should be interesting to readers who are familiar with the latter but not with the former.
This post is inspired by this paper by Salomone et al, which has just been accepted in JRSSB. Congrats to the authors!
Set up
Consider a model with parameter \(x\), prior \(p(x)\), and likelihood \(L(x)\). The posterior is then \[\pi(x) = \frac{p(x)L(x)}{Z},\quad Z = \int p(x)L(x)dx.\] (The Bayesian interpretation is not essential. More generally, \(p\) could be a proposal distribution, \(\pi\) a target distribution, and \(L\) a function proportional to \(\pi/p\).)
Let’s now introduce the following family of distributions: \[\pi_\lambda(x) = \frac{p(x) \ind\{ L(x) > \lambda\}}{Z(\lambda)},\quad Z(\lambda) = \mathbb{P}_\mathrm{prior}(L(X) > \lambda).\] In words, \(\pi_\lambda\) is the prior truncated to the region \(\{x: L(x) > \lambda\}\), and the normalising constant \(Z(\lambda)\) is the prior probability that \(L(X)>\lambda\).
If we introduce a sequence \(0=\lambda_{0} < \lambda_1 < \dots < \lambda_{T+1} = \infty\), we can use a SMC sampler to approximate recursively \(\pi_{\lambda_t}\) and its normalising constant \(Z(\lambda_t)\). Note that the particle weights at each time will be 0 or 1 in this particular SMC sampler, since: \[ \frac{\pi_{\lambda_t}(x)}{\pi_{\lambda_{t-1}}(x)} \propto \ind\{L(x) > \lambda_t\}. \]
The simpler NS identity
Now comes the identity. Let \(\gamma(\varphi) = \int p L \varphi\) for an arbitrary function \(\varphi\) of \(x\). Then: \[\begin{align*} \gamma(\varphi) & = \int p L \varphi \\ & = \sum_{t=0}^T \int p L \varphi \ind\{ \lambda_{t} < L \leq \lambda_{t+1} \} \\ & = \sum_{t=0}^T Z(\lambda_t) \pi_{\lambda_t}\left(L \varphi \times \ind\{ L \leq \lambda_{t+1}\}\right) \end{align*} \]
Thus, if we implement a SMC sampler that tracks the sequence \(\pi_{\lambda_t}\), we will be able to approximate all the above quantities, and thus, through this identity, to approximate the marginal likelihood, \(Z=\gamma(1)\), and posterior moments, \(\pi(\varphi) = \gamma(\varphi) / \gamma(1)\).
Choosing the \(\lambda_t\)’s
In practice, we need to choose the \(\lambda_t\)’s. As in tempering, it seems reasonable to set them automatically, in such a way that the ESS (effective sample size) is \(\alpha N\), for some \(\alpha\in(0, 1)\). Because the weight function is \(0/1\), this amounts to taking \(\lambda_t\) to be the \(\alpha-\)upper quantile of the \(L(X_t^n)\), where the \(X_t^n\)’s are the \(N\) particles sampled at time \(t\) by our SMC sampler. This is what Salomone et al recommend. In this case, we can replace \(Z(\lambda_t)\) in the identity above by \(\alpha^t\), at least for \(t<T\).
The corresponding estimate will be something like: \[ \widehat{\gamma}(\varphi) = \sum_{t=0}^{T-1} \alpha^t \left[ \frac{1}{N} \sum_{n=1}^N \varphi(X_{t+1}^n) L(X_{t+1}^n) \ind\{ L(X_{t+1}^n) \leq \lambda_{t+1}\} \right] + \dots \] where I omitted the \(T-\)th term (it has a slightly different expression, i.e. \(Z(\lambda_T)\neq \alpha^T\)), and I used the fact that the unweighted sample \(X_{t+1}^{1:N}\) generated at the beginning of iteration \(t+1\) currently targets \(\pi_{\lambda_t}\).
Vanilla NS as a particular waste-free SMC sampler
Now assume that, in your adaptive NS-SMC sampler, you set \(\alpha=(1 - \frac 1 N)\) (or equivalently, \(\lambda_{t+1}=\min_n L(X_{t+1}^n)\)); that is, you discard only one particle, the one with smallest likelihood. In other words, you decide to move as slowly as possible up the likelihood function.
If you’d resample the \(N-1\) surviving particles, and apply \(k\) MCMC step to each of them, you would get a very expensive sampler: increasing \(N\) means you both increase the cost of a single iteration, and the total number of iterations (since it makes \(\alpha\) larger).
A cheaper alternative is to choose randomly one of the \(N-1\) surviving particles, apply it a MCMC step, and takes the output as your new \(N-\)th particle. Then, you get an algorithm which is very close to the original NS one. In particular, your estimate of \(Z=\gamma(1)\) becomes: \[ \widehat{\gamma}(1) = \sum_{t=0}^{T-1} \frac{1}{N} (1- \frac 1 N)^t L_{t+1} + \dots \] with \(L_{t+1} = \min_n L(X_{t+1}^n)\). (The original NS estimate has \((1-1/N)^t/N\) replaced by \(\exp(-t/N) - \exp(-(t+1)/N)\), which should be very close numerically for large \(N\).)
This idea of resampling \(N-1\) particles, and move only one of them is reminiscent of waste-free SMC. In waste-free SMC, you resample only \(M\) particles out of \(N\), \(M<N\). Then, assuming \(M\) divides \(N\), i.e., \(N=M\times P\) for some \(P\geq 2\), you apply to each resampled particle \((P-1)\) MCMC steps, and gather the resulting \(M\times P\) states to form a new particle sample of size \(N\). What if \(M\) does not divide \(N\), i.e. \(N=M \times k +r\), \(0<r<M\)? Then it makes sense to generate \(r\) MCMC chains of length \(k+1\), and \(M-r\) chains of length \(k\). This is what happens here, with \(M=N-1\), \(r=1\).
Why did I say we get a “simpler” identity?
The original NS algorithm by Skilling derives essentially the same identity as above, but through more convoluted steps, which involves the CDF of random variable \(L(X)\), when \(X\sim p\), its inverse, Beta distributions, etc. I find the derivation above simpler (at least, again, if you are familiar with SMC samplers). Of course, in return, you get a justification which is a bit hand-wavy for vanilla NS (but for NS-SMC, it is perfectly solid).
Should I care about NS?
There are two sub-questions:
NS vs SMC-NS: Salomone et al give numerical evidence (and arguments) suggesting that NS-SMC outperforms NS.
SMC-NS vs tempering SMC or other SMC schemes: Salomone et al also give numerical evidence suggesting NS-SMC is competitive with tempering SMC, which is intriguing (and in line with independent numerical experiments I did).
I will elaborate on these two points in my next post (coming soon). In the meantime, feel free to have a look at the aforementioned paper, it is well worth a read.