Systematic sampling
Contents |
General descriptions of systematic sampling
Systematic sampling is a name of a wide class of sampling strategies in which selection of individual elements is following a systematic pattern. Examples are square grids of sample points laid out over an area of interest; or the selection of every 10th tree in an alley; or parallel transects.
Systematic sampling and its applications to forest inventory are best illustrated with square grids of sample points. We may imagine a transparency sheet on which this grid is printed; and this transparency is placed randomly over the map, where randomly means: randomly selected starting point and random orientation. From a sample selection point of view, it is important to state that we have only one independent selection of a sample point; after having selected the first point, all others are fixed. We defined earlier that sample size is the number of independently selected elements; an immediate conclusion is that systematic sampling is obviously a sample of size \(n = 1\). The “plot” that is being laid out then is a large cluster plot consisting of numerous sub-plots – that is, all the sample points on the systematic sample are strictly spoken sub-plots of one single cluster that is spread out over the entire area.
A major question is then whether we can make an unbiased estimation of mean and variance from a random sample of size \(n = 1\). For the estimation of the mean, there is no problem at all: the estimator
\[\bar y=\frac{\sum_{i=1}^n y_i}{n}\,\]
can be calculated and yields the estimation of the mean.
However, when we wish to estimate the variance with the estimator
\[s^2=\frac{\sum_{i=1}^n\left(y_i-\bar y\right)^2}{n-1}\,\],
we see that this is not possible as the denominator is not defined. This is also directly understandable by common sense: one single observation does not contain any information about the variability that is present in the population.
It is important to understand for systematic sampling with but one randomization step:
- there is an unbiased estimator for the mean;
- there is no unbiased estimator for the population variance and hence neither one for the estimation of the error variance.
For the estimation of the error variance which is the most important characteristic to evaluate the statistical performance of a sampling technique, we, therefore, need to find a solution. However, at first, some further issues regarding systematic sampling are addressed.
Systematic sampling is, obviously, a specific sampling technique for its own. Some authors do also refer to it as non-statistical sampling because of the sample size \(n=1\) (and in many cases, no randomization is done at all!) and because of the lack of variance estimators.
When we look at systematic sampling from the point of view of the sampling techniques presented so far, we may express it as a specific case of stratified sampling or as a specific case of cluster sampling. This is illustrated in Figure 1 where a population of \(N\) elements is arranged in groups of \(M\) elements. A systematic sample is, for example, taken by selecting the elements of one column. If we look at one line as one stratum, then systematic sampling would mean here to select exactly one element per stratum from all strata. Of course, this does not allow variance estimation. Or we take one line completely, that is exactly one cluster - and that does neither allow estimating the error variance.
Sample size
Strictly spoken, sample size in systematic sampling is \(n = 1\). However, this does not allow any conclusion about the variances. Therefore, it is common to look at the systematic sample as a sample in which the sub-plots are considered the observation plots.
When a square grid is used, one can calculate the required grid size for a certain number of points to fall into forest. If, for example, we define \(n = 20\) plots to be the size of the sample for a 1000 ha forest, the required square grid size is calculated via the area that each sample point “represents” around it. This area is \(1000 ha/20=50 ha\). If this is the area of the square that each sample point represents, then the side length of the square of 50 ha is the searched distance \(d\) between the sample points on the square sample grid, which is here
\[d=\sqrt{\frac{1000ha}{20}}=\sqrt{50ha}=\sqrt{500000m^2}\approx 707.1m.\,\]
However, if this grid is superimposed randomly over the 1000 ha area of interest it is not guaranteed that always exactly the desired number of \(n=20\) sample points falls into the forest area. The number can be slightly higher or lower; that depends mainly on shape and fragmentation structure of the forest area, which is illustrated in Figure 2. What we have calculated here, actually, is not the sample size \(n\) but it is the expected value of the sample size \(E(n)\): on the average we have \(n\) samples in our area when repeating very often a random superimposition of the grid over the forest area.
Some advantages of systematic sampling
Systematic sampling is by far the most frequently applied sampling technique in forest inventory sampling – and there are a number of reasons for that, despite the fact that there is, unfortunately, not yet a design-unbiased estimator for the error variance available.
Among the advantages are:
- The procedure is easily applied in the field or in any other population of interest and it is easily explained to the field crew or those who are supposed to take the samples.
- It is also easy for those who are interested in the results to understand the sampling procedure. Actually, in random sampling, only those who actually did the random selection by themselves know for sure that the selection had been truly done at random. All others need to believe it. Whatever arrangement of sample points results, all can theoretically be random. There are simply very many possibilities for manipulation. In systematic sampling, however, there are much less thus possibilities.
- In practically all cases of forest inventory applications, systematic sampling yields more precise results than simple random sampling with the same number of sample points. This can intuitively be explained because regardless of the randomization of the grid, it will always evenly cover the area or interest. Extreme values are not possible, that may occur in simple random sampling, if, for example, all points fall incidentally at regions with very low values.
However, it can also be explained by the autocorrelation considerations made earlier: in systematic sampling neighboring sample points have always a minimum distance. It is not possible that neighboring sample points are very close together in a situation where the autocorrelation is expected to be high. That means that the systematic sample collects more and uncorrelated information and is thus more precise.
- In connection to the former point: by systematic sampling it is guaranteed that all parts of the population are covered. It can not happen that for a larger region there is no sample point. In fact, when we use a systematic grid of points, the whole population is evenly covered and if we distinguish different situations in the population (strata, sub-populations) the number of sample points in each such stratum is automatically proportional to the size of such strata (see Figure 3).
Preocupations with systematic sampling
The only major preoccupation is the missing unbiased variance estimator. Error variance of systematic sampling, therefore, needs to be approximated. Some approaches are described below.
However, what we know from numerous simulation studies and also from theoretical considerations is that systematic sampling is practically always more precise than simple random sampling with the same number of random points. In many cases systematic sampling produces an error that is many times (!) smaller than that of simple random sampling.
For normal forest inventory applications there is absolutely no reason not to apply systematic sampling. The only cases where the senior author of this article saw simple random sampling applied in forest inventories, and not systematic sampling, was because of a profound misunderstanding of the error variance issue in systematic sampling: the inventory experts there believed that systematic sampling must not be applied because of the missing variance estimator. While it is true that this estimator is missing, again: we know that systematic sampling is more precise – we just can not quantify it; but we may trust that the unbiased mean that we estimated is on the average a better approximation to the true parametric mean than the mean that we would estimate from a simple random sample of the same size.
Some times, the worry is expressed that systematic sampling be heavily biased when applied to cyclic populations with a distance between sampling points that corresponds to the periodicity in the population; an example is usually a region where North-South mountain ranges in a distance of 10 km exist. If then the grid size is, for example 10km, and the grid orientation is North-South, it is said, this yields a systematic error. However, it must be emphasized that the estimator of the mean is not biased - but that simply the error variance is relatively high.
Implementation of systematic sample selection
What we describe so far is the case of systematic sampling in which the sample grid is randomly placed over the region of interest. This is then the situation in which we may look at systematic sampling as a random sample of \(n = 1\) clusters (where the selected cluster is quite large). However, looking at the practical applications of systematic sampling, not all forest inventories follow the randomization approach to selecting a systematic grid.
In most cases, neither the starting point is selected randomly nor the grid orientation; usually even grid coordinates are defined as starting point (for example 1000, 1000) and the grid orientation is usually (and almost traditionally) defined as North-South. By doing so, we do not deal any more with a statistical sampling procedure, and here the term “non-statistical sampling” is certainly justified. There are not many studies that researched into the effect of this lack of randomization. In general it can be assumed that there are not major problems if the definition of starting point and orientation is not being done because of terrain or other subject-matter criteria that might affect the observations.
One of the few studies on that topic is Kleinn (1991[2]): there, a sampling simulation study was carried out with grids of different width with random and with fixed orientation. Two results are presented in Figure 4: on the left side, squares of different side lengths are sampled by grids of unit width; this is equivalent to sampling a square with grids of different sizes. The right hand graph in Figure 4 gives the results of the same comparison of random and fixed grids of different width on a forest map (the same that is used for area estimation). From the cases in Figure 4 we may make a preliminary conclusion: there is a difference in the precision of estimation when comparing random and fixed orientation. For the fixed orientation, the error is mostly higher; this had been observed for practically all forest maps studied in Kleinn (1991[2]); for the simple case of estimating the area of a square, the fixed grid orientation does yield more precise estimations for some specific grid sizes (if the side of the square is a multiple of the grid size). Also, for fixed grid orientation the irregularity of these curves (standard error over grid size) is much wider than for random orientation.
In some cases, systematic sampling with multiple random starts is applied. That means essentially that not only one large cluster is selected but a number of \(n>1\) clusters. Then, error variance can be estimated along the known cluster sampling estimators.
References
- ↑ 1.0 1.1 1.2 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
- ↑ 2.0 2.1 2.2 Kleinn C. 1991. Der Fehler von Flächenschätzungen mit Punkterastern und linienförmigen Stichprobenelementen. Dissertation. Mitteilungen der Abteilung für Forstliche Biometrie 91‑1, Universität Freiburg, 128 S.