Variance issue in systematic sampling
(Created page with "{{Content Tree|HEADER=Forest Inventory lecturenotes|NAME=Forest Inventory lecturenotes}} __TOC__ ==Empirical approximation of error variance== Again and again: there is no de...") |
(→Using SRS estimators) |
||
(31 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | {{ | + | {{Ficontent}} |
− | + | ||
− | + | ||
==Empirical approximation of error variance== | ==Empirical approximation of error variance== | ||
− | + | There is no design-unbiased [[variance estimator]] in systematic sampling. If we are interested in the true error variance, the only way is to very often repeat the [[systematic sampling|systematic sample]] and calculate the variance of all the estimations produced; that is then an empirical approximation to the parametric error variance which is the closer to the unknown true value the larger the number of repetitions is. | |
+ | |||
Of course, this is not a viable approach for practical implementation, but it is something that can be done in computer simulations. | Of course, this is not a viable approach for practical implementation, but it is something that can be done in computer simulations. | ||
− | |||
− | What is most frequently done for variance estimation in systematic sampling is that the simple random sampling framework of estimators is applied. It is clear and known that these estimators are not unbiased for systematic sampling but they yield consistently over-estimations of the true error variance; this positive bias can be considerable. We call this sort of estimation a “conservative estimation”: we know that the true error is less (in many cases much less) than the estimation that has been calculated. An example is presented further down in | + | ===Using SRS estimators=== |
+ | |||
+ | What is most frequently done for variance estimation in systematic sampling is that the [[simple random sampling]] framework of estimators is applied. It is clear and known that these estimators are not unbiased for systematic sampling but they yield consistently over-estimations of the true error variance; this positive bias can be considerable. We call this sort of estimation a “conservative estimation”: we know that the true error is less (in many cases much less) than the estimation that has been calculated. An example is presented further down in the article [[Estimating forest area|Area estimation by points]], where the area estimation by dot grids is presented. | ||
{{info | {{info | ||
|message=Remember | |message=Remember | ||
− | |text= | + | |text=For error variance estimation we use to apply the simple random sampling estimator |
+ | |||
+ | :<math>s_\bar y^2=\frac{s^2}{n}</math> | ||
+ | |||
+ | (essentially, because we do not <math>k</math> now better …). This, however, is not an unbiased estimator but produces an overestimation of the true error variance. | ||
}} | }} | ||
+ | ===Random differences method=== | ||
+ | Numerous approximations had been developed to better approximate the true error variance than with the simple random sampling estimator. Two of the more simple ones are presented here, starting with the so called “random differences method”. | ||
+ | |||
+ | Assume that the elements in the population and also the <math>n</math> elements that are in the systematic sample have the same expected value. We actually may assume that because we have an unbiased estimator for the mean. If we select (repeatedly) random pairs out of the <math>n</math> elements of the systematic sample and calculate the difference for each of the pairs, we would expect the expected value of this difference to be zero: | ||
+ | |||
+ | Let <math>d</math> be <math>Y_1-Y_2</math>, then <math>E(d)=\mu=E(Y_1-Y_2)=E(Y_1)-E(Y_2)=0</math>. | ||
+ | |||
+ | The variance of the difference <math>var(\bar d)=var(Y_1-Y_2)</math> is then be determined along the rules for linear combinations of random variables as known from developing the estimators for [[Stratified sampling|stratified random sampling]]; as we select each one of the two elements of a pair independently at random, the covariance term below becomes zero: | ||
+ | |||
+ | {| | ||
+ | |<math>var(\bar d)\,</math>||<math>=var(Y_1-Y_2)\,</math> | ||
+ | |- | ||
+ | |||<math>=var(Y_1)+var(Y_2)-2cov(Y_1Y_2)\,</math> | ||
+ | |- | ||
+ | |||<math>=var(Y_1)+var(Y_2)\,</math> | ||
+ | |- | ||
+ | |||<math>=2\sigma^2\,</math> | ||
+ | |} | ||
+ | |||
+ | where <math>\sigma^2</math> is the population variance of both <math>Y1</math> and <math>Y2</math>, which is the same. | ||
+ | If <math>n_d</math> pairs are formed, that population variance <math>\sigma^2</math> is estimated by | ||
+ | |||
+ | :<math>2\sigma^2=\frac{\sum_{i=1}^{n_c}\left(d_i-\bar{d}\right)^2}{n_d}=\frac{\sum_{i=1}^{n_d}d_i^2}{n_d}\,</math> | ||
+ | |||
+ | and the estimated error variance of the mean for systematic sampling with the random pairs method is | ||
+ | |||
+ | :<math>\hat{var_{rd}}\left(\bar y_{syst}\right)=\frac{\hat{\sigma}^2}{n}=\frac{1}{n}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,</math> | ||
+ | |||
+ | ===Pair difference technique=== | ||
+ | [[File:5.5.6.4-fig91.png|right|thumb|300px|'''Figure 1''' Building pairs of neighboring observations for the approximation of error variance in systematic sampling (Kleinn 2007<ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.</ref>). Pairs can either be built “exclusively” (below) or overlapping (above).]] | ||
+ | Another approach had been developed by Lindeberg (1924) analyzing the systematic field data of the early Nordic national forest inventories. He imagined neighboring observations to form a stratum so that the whole sample of <math>n</math> elements consists of <math>n/2</math> strata and in each stratum the sample size is <math>n_h = 2</math> (see Figure 1). Then, he applied the formula for [[stratified random sampling]] and came up with the below formula. Of course, this is again only an approximation because neither the estimators of stratified random sampling apply, because sampling within the strata was not random. | ||
+ | However, it proved in many simulation studies that this approximation is in many cases fairly close to the true error variance; some times over-estimating, some times under-estimating; depending on the [[population structure]] and the sample taken. | ||
+ | |||
+ | An example for area estimation with dot grids is presented in the chapter "Comparison of different grid shapes in systematic sampling", which can be found below. | ||
+ | |||
+ | In a stratum with <math>n_h=2</math> elements randomly sampled, the population variance within that stratum <math>h</math> is estimated from | ||
+ | |||
+ | :<math>s_h^2=\frac{\sum_{i=1}^n\left(y_{hi}-\bar{y}_h\right)^2}{n_h-1}=\frac{1}{2}\left(y_{h1}-y_{h2}\right)^2\,</math> | ||
+ | |||
+ | - where the variance formula converts into a simple squared difference. | ||
+ | |||
+ | Assuming that we form <math>L</math> strata of the same size so that the stratum weights are constantly <math>w_h=1/L</math>. The error variance for the total of all strata results then as usual in stratified random sampling from | ||
+ | |||
+ | :<math>\hat{var}_{pd}\left(\bar{y}_{syst}\right)=\sum_{h=1}^L w_h^2\frac{s_h^2}{n_h}=\sum_{h=1}^L\frac{\left(y_1-y_2\right)^2}{4L^2}\,</math>. | ||
+ | |||
+ | This estimation corresponds actually to the error variance estimator of the random differences technique if we select <math>n_d=n/2</math> pairs of observations: | ||
+ | |||
+ | :{| | ||
+ | |<math>\hat{var}_{rd}\left(\bar{y}_{syst}\right)\,</math>||<math>=\frac{1}{n}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,</math> | ||
+ | |- | ||
+ | |||<math>=\frac{1}{2n_d}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,</math> | ||
+ | |- | ||
+ | |||<math>=\frac{1}{4L^2}\sum_{i=1}^{n_d}d_i^2\,</math> | ||
+ | |} | ||
+ | |||
+ | The pair differences technique may also be applied for overlapping pairs as depicted in Figure 1. | ||
+ | |||
+ | {{Exercise | ||
+ | |message=Pair difference technique example | ||
+ | |alttext=Example | ||
+ | |text=Example for Pair difference technique | ||
+ | }} | ||
+ | |||
+ | ==Consequences of variance approximation in systematic sampling== | ||
+ | |||
+ | When estimation is the only issue, systematic sampling is always to be preferred for forest inventory. | ||
+ | |||
+ | However, if statistical inference should be made that involves testing or comparing estimations, one should seriously consider whether the only approximated error variances do not invalidate the tests and comparisons. All (parametric) statistical testing requires unbiased estimation of variances. | ||
+ | |||
+ | All other calculations that include variances (such as calculating the [[confidence intervals]], calculating required sample size for a given precision expectation etc.) are also affected by the missing design-unbiased error variance estimation. That must always be taken into account when, for example, the conservative estimation from the simple random sampling estimator is used: the required sample size is overestimated for predefined [[precision levels]]; the width of the confidence interval is equally overestimated; and when a comparison is made between two systematic samples (for example with the <math>t</math>-test), the probability <math>\alpha</math> of committing a Type I error will be smaller than for those tests where an unbiased estimation of the error variance can be done; this implies that the test is conservative and has less power. | ||
+ | |||
+ | In conclusion, if statistically unbiased estimation of error variance is required in further analyses, systematic sampling of the <math>n=1</math> type should be avoided. Then, for example, systematic sampling with [[multiple random starts]] might be considered. | ||
+ | |||
+ | ==Comparison of different grid shapes in systematic sampling== | ||
+ | |||
+ | [[File:5.5.8-fig92.png|right|thumb|500px|'''Figure 2''' Different patterns of systematic sample grids. A being a square grid, B being a rectangular grid with <math>a:b=2:1</math>, C being a rectangular grid with <math>a:b=8:1</math>, and D being a triangular grid as defined in Matérn (1960<ref name="matérn1960">Matérn B. 1960. Spatial variation – Stochastic models and their application to some problems in forest surveys and other sampling investigations. Medd. Statens Skogsforskningsinstitut 49:5.</ref>).]] | ||
+ | |||
+ | Systematic sampling can be done by many different spatial arrangements of sample points. So far we mainly talked about square grids. However, other shapes such as rectangular and triangular are also being used. Matérn (1960<ref name="matérn1960">Matérn B. 1960. Spatial variation – Stochastic models and their application to some problems in forest surveys and other sampling investigations. Medd. Statens Skogsforskningsinstitut 49:5.</ref>) did an instructive comparison of the statistical performance of these different grid types for area estimation with dot grids. | ||
+ | |||
+ | He compared grids with the same point density (that is, the same average number of points per unit area), and came to the following result (also depicted in Figure 2): the triangular grid is the most precise on average because it has – given the same point density – the largest average inter-point distance. The square grid is slightly less precise; but the difference is just <math>1-4%</math> according to the results of Matérn (1960). With a rectangular grid of side lengths <math>a:b=1:2</math>, precision was <math>20-70%</math> lower and finally for the transect like dot grid with <math>a:b=1:8</math> he did not find any more a difference to [[simple random sampling]]. | ||
+ | |||
+ | While the triangular grid is the most precise, for practical applications it appears justified to use the square grid as optimal shape, because in many cases the square is much easier implemented in the field than the triangle. | ||
+ | |||
+ | In some cases, a rectangular grid of <math>1:2</math> is used because walking time between the grid points is much shorter on average. However, this goes along with a loss in precision. | ||
==References== | ==References== | ||
<references/> | <references/> | ||
− | {{ | + | {{SEO |
+ | |keywords=variance issue,systematic sampling,variance estimator,srs estimator,random difference method,pair difference technique,grid shapes | ||
+ | |descrip=The variance issue in systematic sampling is about repeating the systematic sample and calculate the variance of all the estimations produced. | ||
+ | }} | ||
[[Category:Sampling design]] | [[Category:Sampling design]] |
Latest revision as of 08:29, 8 May 2017
Contents |
[edit] Empirical approximation of error variance
There is no design-unbiased variance estimator in systematic sampling. If we are interested in the true error variance, the only way is to very often repeat the systematic sample and calculate the variance of all the estimations produced; that is then an empirical approximation to the parametric error variance which is the closer to the unknown true value the larger the number of repetitions is.
Of course, this is not a viable approach for practical implementation, but it is something that can be done in computer simulations.
[edit] Using SRS estimators
What is most frequently done for variance estimation in systematic sampling is that the simple random sampling framework of estimators is applied. It is clear and known that these estimators are not unbiased for systematic sampling but they yield consistently over-estimations of the true error variance; this positive bias can be considerable. We call this sort of estimation a “conservative estimation”: we know that the true error is less (in many cases much less) than the estimation that has been calculated. An example is presented further down in the article Area estimation by points, where the area estimation by dot grids is presented.
\[s_\bar y^2=\frac{s^2}{n}\]
(essentially, because we do not \(k\) now better …). This, however, is not an unbiased estimator but produces an overestimation of the true error variance.
[edit] Random differences method
Numerous approximations had been developed to better approximate the true error variance than with the simple random sampling estimator. Two of the more simple ones are presented here, starting with the so called “random differences method”.
Assume that the elements in the population and also the \(n\) elements that are in the systematic sample have the same expected value. We actually may assume that because we have an unbiased estimator for the mean. If we select (repeatedly) random pairs out of the \(n\) elements of the systematic sample and calculate the difference for each of the pairs, we would expect the expected value of this difference to be zero:
Let \(d\) be \(Y_1-Y_2\), then \(E(d)=\mu=E(Y_1-Y_2)=E(Y_1)-E(Y_2)=0\).
The variance of the difference \(var(\bar d)=var(Y_1-Y_2)\) is then be determined along the rules for linear combinations of random variables as known from developing the estimators for stratified random sampling; as we select each one of the two elements of a pair independently at random, the covariance term below becomes zero:
\(var(\bar d)\,\) | \(=var(Y_1-Y_2)\,\) |
\(=var(Y_1)+var(Y_2)-2cov(Y_1Y_2)\,\) | |
\(=var(Y_1)+var(Y_2)\,\) | |
\(=2\sigma^2\,\) |
where \(\sigma^2\) is the population variance of both \(Y1\) and \(Y2\), which is the same. If \(n_d\) pairs are formed, that population variance \(\sigma^2\) is estimated by
\[2\sigma^2=\frac{\sum_{i=1}^{n_c}\left(d_i-\bar{d}\right)^2}{n_d}=\frac{\sum_{i=1}^{n_d}d_i^2}{n_d}\,\]
and the estimated error variance of the mean for systematic sampling with the random pairs method is
\[\hat{var_{rd}}\left(\bar y_{syst}\right)=\frac{\hat{\sigma}^2}{n}=\frac{1}{n}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,\]
[edit] Pair difference technique
Another approach had been developed by Lindeberg (1924) analyzing the systematic field data of the early Nordic national forest inventories. He imagined neighboring observations to form a stratum so that the whole sample of \(n\) elements consists of \(n/2\) strata and in each stratum the sample size is \(n_h = 2\) (see Figure 1). Then, he applied the formula for stratified random sampling and came up with the below formula. Of course, this is again only an approximation because neither the estimators of stratified random sampling apply, because sampling within the strata was not random.
However, it proved in many simulation studies that this approximation is in many cases fairly close to the true error variance; some times over-estimating, some times under-estimating; depending on the population structure and the sample taken.
An example for area estimation with dot grids is presented in the chapter "Comparison of different grid shapes in systematic sampling", which can be found below.
In a stratum with \(n_h=2\) elements randomly sampled, the population variance within that stratum \(h\) is estimated from
\[s_h^2=\frac{\sum_{i=1}^n\left(y_{hi}-\bar{y}_h\right)^2}{n_h-1}=\frac{1}{2}\left(y_{h1}-y_{h2}\right)^2\,\]
- where the variance formula converts into a simple squared difference.
Assuming that we form \(L\) strata of the same size so that the stratum weights are constantly \(w_h=1/L\). The error variance for the total of all strata results then as usual in stratified random sampling from
\[\hat{var}_{pd}\left(\bar{y}_{syst}\right)=\sum_{h=1}^L w_h^2\frac{s_h^2}{n_h}=\sum_{h=1}^L\frac{\left(y_1-y_2\right)^2}{4L^2}\,\].
This estimation corresponds actually to the error variance estimator of the random differences technique if we select \(n_d=n/2\) pairs of observations:
\(\hat{var}_{rd}\left(\bar{y}_{syst}\right)\,\) \(=\frac{1}{n}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,\) \(=\frac{1}{2n_d}\frac{1}{2n_d}\sum_{i=1}^{n_d}d_i^2\,\) \(=\frac{1}{4L^2}\sum_{i=1}^{n_d}d_i^2\,\)
The pair differences technique may also be applied for overlapping pairs as depicted in Figure 1.
Pair difference technique example: Example for Pair difference technique
[edit] Consequences of variance approximation in systematic sampling
When estimation is the only issue, systematic sampling is always to be preferred for forest inventory.
However, if statistical inference should be made that involves testing or comparing estimations, one should seriously consider whether the only approximated error variances do not invalidate the tests and comparisons. All (parametric) statistical testing requires unbiased estimation of variances.
All other calculations that include variances (such as calculating the confidence intervals, calculating required sample size for a given precision expectation etc.) are also affected by the missing design-unbiased error variance estimation. That must always be taken into account when, for example, the conservative estimation from the simple random sampling estimator is used: the required sample size is overestimated for predefined precision levels; the width of the confidence interval is equally overestimated; and when a comparison is made between two systematic samples (for example with the \(t\)-test), the probability \(\alpha\) of committing a Type I error will be smaller than for those tests where an unbiased estimation of the error variance can be done; this implies that the test is conservative and has less power.
In conclusion, if statistically unbiased estimation of error variance is required in further analyses, systematic sampling of the \(n=1\) type should be avoided. Then, for example, systematic sampling with multiple random starts might be considered.
[edit] Comparison of different grid shapes in systematic sampling
Systematic sampling can be done by many different spatial arrangements of sample points. So far we mainly talked about square grids. However, other shapes such as rectangular and triangular are also being used. Matérn (1960[2]) did an instructive comparison of the statistical performance of these different grid types for area estimation with dot grids.
He compared grids with the same point density (that is, the same average number of points per unit area), and came to the following result (also depicted in Figure 2): the triangular grid is the most precise on average because it has – given the same point density – the largest average inter-point distance. The square grid is slightly less precise; but the difference is just \(1-4%\) according to the results of Matérn (1960). With a rectangular grid of side lengths \(a:b=1:2\), precision was \(20-70%\) lower and finally for the transect like dot grid with \(a:b=1:8\) he did not find any more a difference to simple random sampling.
While the triangular grid is the most precise, for practical applications it appears justified to use the square grid as optimal shape, because in many cases the square is much easier implemented in the field than the triangle.
In some cases, a rectangular grid of \(1:2\) is used because walking time between the grid points is much shorter on average. However, this goes along with a loss in precision.
[edit] References
- ↑ Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
- ↑ 2.0 2.1 Matérn B. 1960. Spatial variation – Stochastic models and their application to some problems in forest surveys and other sampling investigations. Medd. Statens Skogsforskningsinstitut 49:5.