Stratified sampling is a *Monte Carlo* (MC) method to estimate an integral of form \[
I = \int_\mathcal{X} p(x) f(x) dx
\] via partitioning \(\mathcal{X}\) into smaller groups, called *strata*. It is a type of *quasi-Monte Carlo* (QMC) method as it introduces deterministic procedures into the MC estimation. Same as crude Monte Carlo the estimation is still unbiased; however, the variance of the estimator can be smaller than crude MC.

The crude Monte Carlo estimator of \(I\) is \[ \hat{I} = \frac{1}{N} \sum_{i=1}^N f(x^i) \] where \(x^i \sim p(x)\) and \(N\) is the number of MC samples.

This estimator is an *unbiased*^{[1]} estimator of the original target \(I\): \[
\begin{aligned}
\mathbb{E}[\hat{I}]
&= \mathbb{E} \left[\frac{1}{N} \sum_{i=1}^N f(x^i) \right] \\
&= \frac{1}{N} \sum_{i=1}^N \mathbb{E} [f(x^i)] \\
&= \frac{1}{N} \sum_{i=1}^N I \\
&= I
\end{aligned}.
\]

Now let's look at the variance of the estimator \(\mathbb{V}[\hat{I}]\). We first need to define \(\sigma^2 := \int_\mathcal{X} p(x) (f(x) - I)^2 dx\), which is the variance of \(f(x)\) under the distribution \(p(x)\). Now the variance of interest is \[ \begin{aligned} \mathbb{V}[\hat{I}] &= \mathbb{V} \left[\frac{1}{N} \sum_{i=1}^N f(x^i) \right] \\ &= \frac{1}{N^2} \sum_{i=1}^N \mathbb{V}[f(x^i)] \\ &= \frac{1}{N^2} \sum_{i=1}^N \sigma^2 \\ &= \frac{\sigma^2}{N} \\ \end{aligned}. \] As it can be seen, the variance drops with the order of \(O(\frac{1}{N})\) by using more samples.

I now use an area estimation example to demonstrate how the empirical estimate of the variance of the crude Monte Carlo estimator changes as the number of samples increases. The area \(S\) I am interested in estimating is illustrated in the figure below.^{[2]}

This is equivalent to computing the integral below \[ \begin{aligned} S &= \int_{[-1, 1]^2} \mathbb{1}(x \in S) dx \\ &= 4 \int_{[-1, 1]^2} \frac{1}{4} \mathbb{1}(x \in S) dx \\ &= 4 \int_{[-1, 1]^2} \mathcal{U}(x; [-1, 1]^2) \mathbb{1}(x \in S) dx \end{aligned} \] where \(\mathbb{1}\) is the indicator function and \(\mathcal{U}(x; A)\) is the uniform distribution over \(A\).

The crude MC estimator for \(S\) is then \[ \hat{S} = 4 \frac{1}{N} \sum_{i=1}^N \mathbb{1}(x^i \in S) \] where \(x^i \sim \mathcal{U}(x; [-1, 1]^2)\) and \(N\) is the MC samples.

I now visualize the variance of \(\hat{S}\) changes as \(N\) increases.

Note that both the x- and y-axis are in logarithm scale. As we can see, it matches our analysis results, which is \(O(\frac{1}{N})\).^{[3]}

Now, we look at how stratified sampling works and how the variance of the corresponding estimator changes depending on different stratifications. For the integral of interest \(I = \int_\mathcal{X} p(x) f(x) dx\), we first partition \(\mathcal{X}\) into \(H\) strata s.t. \(\mathcal{X} = \mathcal{X}_1 \cup \dots \cup \mathcal{X}_H\). Then, as the simplest way of performing stratified sampling, we *uniformly*^{[4]} draw independent Monte Carlo samples for each stratum \(\mathcal{X}_h\), and the overall estimator is given as below \[
\hat{I}_\mathrm{STRAT} = \sum_{h=1}^H \frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} f(x_h^i)
\] where \(\{x_h^i\}_{i=1}^{n_h}\) are samples that lie in \(\mathcal{X}_h\) and \(n_h\) is the number of MC samples in stratum \(h\).

The mean of \(f\) within each stratum \(h\) is \[ \begin{aligned} \mu_h &= \int_{\mathcal{X}_h} U(x; \mathcal{X}_h) f(x) dx \\ &= |\mathcal{X}_h|^{-1} \int_{\mathcal{X}_h} f(x) dx \end{aligned} \] and the corresponding variance is \[ \begin{aligned} \sigma_h^2 &= \int_{\mathcal{X}_h} U(x; \mathcal{X}_h) (f(x) - \mu_h)^2 dx \\ &= |\mathcal{X}_h|^{-1} \int_{\mathcal{X}_h} (f(x) - \mu_h)^2 dx \end{aligned}. \]

As it is mentioned in the beginning, \(\hat{I}_\mathrm{STRAT}\) is still an unbiased estimator of \(I\). We can validate this by computing the mean of \(\hat{I}_\mathrm{STRAT}\) \[ \begin{aligned} \mathbb{E}[\hat{I}_\mathrm{STRAT}] &= \mathbb{E}\left[\sum_{h=1}^H \frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} f(x_h^i)\right] \\ &= \sum_{h=1}^H \frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} \mathbb{E}[f(x_h^i)] \\ &= \sum_{h=1}^H \frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} |\mathcal{X}_h|^{-1} \int_{\mathcal{X}_h} f(x) dx \\ &= \sum_{h=1}^H \frac{1}{n_h}\sum_{i=1}^{n_h} \int_{\mathcal{X}_h} f(x) dx \\ &= \sum_{h=1}^H \int_{\mathcal{X}_h} f(x) dx \\ &= \int_{\mathcal{X}_1 \cup \dots \cup \mathcal{X}_H} f(x) dx \\ &= \int_\mathcal{X} f(x) dx \\ &= I \end{aligned}. \]

The variance of \(\hat{I}_\mathrm{STRAT}\) is then \[ \begin{aligned} \mathbb{V}[\hat{I}_\mathrm{STRAT}] &=\mathbb{V}\left[\sum_{h=1}^H \frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} f(x_h^i)\right] \\ &=\sum_{h=1}^H \mathbb{V}\left[\frac{|\mathcal{X}_h|}{n_h}\sum_{i=1}^{n_h} f(x_h^i)\right] \\ &=\sum_{h=1}^H \frac{|\mathcal{X}_h|^2}{n_h^2}\mathbb{V}\left[\sum_{i=1}^{n_h} f(x_h^i)\right] \\ &=\sum_{h=1}^H \frac{|\mathcal{X}_h|^2}{n_h^2}\sum_{i=1}^{n_h} \mathbb{V}[f(x_h^i)] \\ &=\sum_{h=1}^H \frac{|\mathcal{X}_h|^2}{n_h^2}\sum_{i=1}^{n_h} \sigma_h^2 \\ &=\sum_{h=1}^H \frac{|\mathcal{X}_h|^2}{n_h}\sigma_h^2 \end{aligned}. \]

We are interested in comparing this estimator against Monte Carlo. In particular, we'd like to show that when \(n_h\) is allocated proportionally to \(|\mathcal{X}_h|\) the estimator cannot increase in variance as compared to crude Monte Carlo.

In order to show this, we need to rewrite the variance of crude Monte Carlo in terms of strata. Let's introduce \(h(x)\) - the stratum that contains \(x\): \(x \in \mathcal{X}_{h(x)}, \forall x \in \mathcal{X}\). By the law of total variance, we have \[ \begin{aligned} \sigma^2 = \mathbb{V}[f(x)] &= \mathbb{E}[\mathbb{V}[f(x) \;\vert\; h(x)]] + \mathbb{V}[\mathbb{E}[f(x) \;\vert\; h(x)]] \\ &= \sum_{h=1}^H |\mathcal{X}_h| \sigma_h^2 + \sum_{h=1}^H |\mathcal{X}_h| (\mu_h - I)^2 \end{aligned}. \] Thus \(\mathbb{V}[\hat{I}] = \frac{\sigma^2}{n} = \frac{1}{n} \left( \sum_{h=1}^H |\mathcal{X}_h| \sigma_h^2\right) + \frac{1}{n} \left(\sum_{h=1}^H |\mathcal{X}_h| (\mu_h - I)^2 \right)\)

Note that in the case of proportional allocation, because \(n_h = n |\mathcal{X}_h|\), \[
\mathbb{V}[\hat{I}_\mathrm{STRAT}] = \frac{1}{n} \sum_{h=1}^H |\mathcal{X}_h| \sigma_h^2
\] which is smaller than \(\mathbb{V}[\hat{I}]\). Although it is not the optimal allocation^{[5]}, proportional allocation as applied to stratified sampling will never increase the variance over that of crude Monte Carlo. Also, note that a poor allocation can also increase the variance of the estimator in comparison to crude Monte Carlo.

It is interesting to note that, if \(f\) is in 2D and is dominated by the horizontal direction more than the vertical direction, partitioning in the horizontal direction would be more effective than partitioning in the vertical direction. We will see it in the example soon.

In this example, we look into 2 ways of stratifying the previously defined area.

Vertically partitioning the space into 10 parts

Horizontally partitioning the space into 10 parts

We illustrate where the samples are located for each method and for crude Monte Carlo in the figure below.

Notice that there are 100 samples in total for each method and that, for the stratified sampling methods, there are exactly 10 samples in each box.

Now what is interesting is the rate of variance of each estimator as the number of samples increase (shown below).

As it can be seen, both QMC methods converge faster than crude Monte Carlo, as expected by our analysis. What's more interesting is that horizontally partitioning the space is more effective than vertically partitioning it. This is because the pre-defined shape is dominated by the y-axis and not the x-axis. This domination can be explained by the shape's symmetry along the y-axis and not the x-axis which means that the y-axis contains more information about the shape.

[1] | The estimator \(\hat{I}\) is an unbiased estimator of \(I\) if \(\mathbb{E}[\hat{I}] = I\). |

[2] | The exact area is \(S = \frac{\pi}{2} + \frac{2}{3}\). |

[3] | The plot of \(y = \frac{1}{x}\) in a log-log plot is a straight line. |

[4] | Stratification doesn't lose generality as any \(I\) of interest can be reformulated to take uniform samples. |

[5] | The optimal allocation is taking \(n_h \propto |\mathcal{X}_h| \sigma_h\) |

© Kai Xu. Last modified: March 23, 2020. Website built with Franklin.jl.