5.7. Variance Reduction Techniques#
Although the MC error always converges as \(N^{-1 / 2}\) where \(N\) is number of sampling points (or equivalently computer time), there are various ways to reduce the coefficient. These techniques go under the name of variance reduction. Consider the integral:
( \(x\) could be a multidimensional variable.) Define the efficiency of a MC calculation as:
where \(t\) is the time to sample a given point and \(\sigma^{2}\) is the variance of a given sample. We want to devise an algorithm to maximize \(\xi\). Usually one does this by reducing the variance. Of course, you can also get a faster computer, or write a more efficient code. It is the product that matters so doing both is even better. The final statistical error \(\epsilon\) is related to the efficiency as
where \(T=n t\) is the total computer time of run.
5.7.1. Importance sampling#
The most common MC variance reduction method is to optimize the sampling of \(x\). Suppose we sample \(x\) from \(p(x)\) where \(p(x)\) is arbitrary for the moment. Then we rewrite the integral as:
where the subscript indicates the sampling function. The estimator is \(f(x) / p(x)\). The variance is:
The mean value of the estimator is independent of \(p(x)\) but the variance is not. To simplify things, let us assume that the CPU time/sample ( \(t\) ) is independent of \(p(x)\). Then we can vary \(p(x)\) to minimize \(\sigma^{2}\). Remember that \(p(x)\) must remain a probability distribution. The simplest way to maintain the constraints is to substitute \(p(x)=q^{2}(x) / \int d x q^{2}(x)\) where \(q(x)\) is a completely arbitrary real function. After solving for \(\delta \sigma^{2} / \delta q(x)=0\) we find the optimal sampling distribution:
The optimal estimator is:
Our estimator is \(\pm\) constant; it only changes when the sign of \(f(x)\) changes! We see that if \(f\) is entirely positive or negative, the optimal variance is zero. This is an example of a zero variance principle. As the sampling approaches the optimal sampling, the variance approaches zero. But notice that we can only obtain the optimal by performing the integral since we need \(I\) to be able to sample \(p^{*}(x)\). This is typical. Optimal sampling requires first solving the problem! However, by using some improved sampling, incorporating some of the features of \(p^{*}\), one can drastically reduce the variance.
Example: let \(f(x)=\exp \left(-x^{2} / 2\right) /\left(1+x^{2}\right)\) be integrated from \((-\infty\) to \(\infty)\). Let us take as an importance sampling function \(p(x)=(2 \pi a)^{-1 / 2} \exp \left(-x^{2} /(2 a)\right)\) with \(a\) a free parameter. Numerically we find the minimum occurs at \(a \approx 0.6\) getting a reduction of 9 in the variance over the “obvious” value of \(a=1\). The estimator is proportional to \(\exp (-1 / 2(1-1 / a)) /\left(1+x^{2}\right)\) and hence blows up at large \(x\) if \(a<1\). But the variance integral goes as \(\exp (-1 / 2(2-1 / a))\). It exists only for \(a>1 / 2\) ! The figures show what happens when you go ahead and importance sample in such a way that the variance does not exist.
The code that produced these numbers is:
nsamp=10000 ! number of samples
a=0.6 ! sampling parameter
! some constants
sa=sqrt(a)
pi=3.14159265
s2pia=sqrt(2*pi*a)
exa=(1/a-1)/2
! zero mean and variance
xm=0
x2=0
do isamp=1, nsamp ! loop over samples
chi=sqrt(-2*log(ranf())*cos(pi*ranf()) ! compute ndrn
x=sa*chi ! scale by a
e=s2pia*exp(x*x*exa)/(x*x+1) !estimator
xm=xm+e ! add to mean
x2=x2+e**2 ! add to square
enddo
write (*,*) 'mean =',xm/nsamp,
+' error=',sqrt((x2/nsamp-(xm/nsamp)**2)/(nsamp-1))
end
Important rules on setting up an importance sampling:
Once you have chosen \(p(x)\), prove that the variance exists. Look at the limits at large \(x\) and at places where \(p(x)\) or \(f(x)\) vanish. Look at the distribution of values to make sure it is Gaussian.
In general, do not under sample. Make sure that \(f(x) / p(x)\) never gets too big. In the above example, increasing \(a\) is stable, decreasing \(a\) is dangerous.
The value, with the computed errors, should be independent of \(p(x)\). Make sure you pass this test. Otherwise you have a bug.
Once you are close to the minimum variance, just run longer. You can waste a lot of time trying to further optimize.
For functions with both positive and negative pieces, optimal sampling does not lead to zero variance. Let us define the fraction of time that one samples positive (or negative) values of \(f\) as:
with a similar definition for \(n_{-}\)and \(Z=\int d x|f(x)|\). Clearly: \(I=Z\left(n_{+}-n_{-}\right)\). The quantity in parentheses is known as the “average sign”. It can be shown by rearranging terms that the minimum variance is:
Hence we want to make sure that our integral is either mostly positive or mostly negative, otherwise there is a considerable loss of efficiency. The relative error is:
which becomes very large if the average sign is close to zero. This is why we say: Monte Carlo is good at adding but cannot subtract. One might think that one could simply shift the function by adding a constant, but this rarely works very well, particularly in high dimensions or for integrals over infinite domains.
5.7.2. Antithetic Variates#
The idea here is to use negative correlation to reduce the variance. That is we couple a point where \(f\) is large ( \(x\) such that \(f(x) / p(x)>I\) ) with a point \(f\) is small. Suppose \(f\) is generally positive for \(x>0\) and generally negative for \(x<0\). Then we write the integral
We couple the evaluation of \(x\) with \(-x\). With the assumed properties, the variance of \(\hat{f}(x)=\) \(1 / 2[f(x)+f(-x)]\) may be much less than the variance of \(f(x)\). Also \(\hat{f}\) may be more suitable for the application of importance sampling.
5.7.4. References#
Hammersley and Handscomb, pgs 50-75.
Kalos and Whitlock, pgs 89-114.