Stratified sampling
(→Stratified sampling) |
|||
Line 283: | Line 283: | ||
{{FSWr}} | {{FSWr}} | ||
− | [[Category:Sampling | + | [[Category:Sampling design]] |
Revision as of 13:39, 13 January 2009
Contents |
Stratified sampling
Stratified sampling is actually not a new Sampling design of its own, but a procedural method to subdivide a population into seperate and more homogenious sub-populations called strata (Kleinn 2007[1]). In some situations it is useful from a statistical point of view, or required for practical and organizational reasons, to subdivide the population in different strata. The major characteristic is that independent sampling studies are carried out in each stratum where all strata are considered as sub-populations of which the parameters need to be estimated. If random sampling is applied, we call that stratified random sampling. The only difference between random sampling and stratified random sampling is that the last consists of various sampling studies and the only thing we have to consider is how to combine the estimations that come from the single strata in order to produce estimations for the total population.
Startified sampling is efficient especially in those cases where the variability inside the strata is low and the differences of means between the strata is large (Akca 2001[2]). In this case we can achive a higher precision with the same sample size.
We can distinguish two general approaches for stratification, the so called pre-stratification in which strata are formed before the sampling study starts, and the post-stratification, where we generate strata in course of the sampling or even afterwords based on the data. In the first case, that is described in this artice, the strata must be defined and - in case of geographical strata - deleniated to define the sampling frame.
Beside statistical issues there are further arguments for stratification. The precondition for a meaningfull partitioning of a population in non-overlapping strata is the availability of prior information that can be used as stratification criteria (de Vries 1986[3]). In forest inventories these informations might be available in form of forest managament or GIS-data or can be derived from remote sensing data like arial fotos. Most efficient from a statistical point of view, is the stratification of a population proportinal to the target value of the Inventory. As this target value is typically not known before the Inventory, forest variables that are correlated to this value are used as stratification criteria. In large managed forest areas age classes or forest types might be for example good stratification criterias if the estimation of volume per ha is targeted.
Arguments for stratification
Sometimes it is useful to subdivide the population of interest in a number of sub-populations and carry out an independent sampling in each of these strata. There are statistical as well as practical considerations that makes this technique very favorable and interesting for large area forest inventories. Not without reason almost all national forest invetories are based on stratification.
- Statistical justifications
- The spatial distribution of sample points inside the population is more evenly, if these points are selected in single strata,
- It is possible to make an individual optimization of sampling and plot design for each stratum,
- One usually increases the precision of the estimations for the total population,
- Separate estimations for each of the strata are produced in a pre-planned manner,
- It is guaranteed that there are actually sufficient observations in each one of the strata,
- It is possible to produce estimations with defined precision level for sub-populations.
- Practical justification
- The possibility to optimize the inventory design seperately for each stratum is very efficient and helps to minimize costs,
- To facilitate inventory work (particularly field work): independent field campaigns can be carried out in each stratum,
- It allows a better spezialization of field crews (e.g. botanists).
Stratification criteria
For the partitioning of a population we can imagine very different stratification criterias. If the reason for stratification is not an improvement of the accuracy estimations, the stratification variables must not necessarily be correlated to the target value. Under special conditions it might be usefull to startify even if there are no statistical justifications. For example might a political boundary dividing a forest area be a good reason for stratification, even if the forest is very homogenious, if afterwords estimations for both parts should be derived seperately. Other examples for meaningfull criteria are:
- Geographical startification
- Ecozones,
- Forest types,
- Site and soil types,
- Topographical conditions,
- Political boundaries or properties,
- ...
Further one can imagine to use the expected inventory costs (regarding time consumption) as criterion. These costs are typically correlated to the above mentioned geographical conditions. It might be for example reasonable to stratify a forest area by slope classes, if time consumption for field work differs significantly between flat terrain and steep slopes.
- Subject matter stratification
- Species,
- Species groups (e.g. commercial / non-commercial),
- Tree sociological classes,
- Age classes in plantation forests,
- ...
Statistics
The only new concept that needs to be introduced in stratified random sampling is how to combine the estimates derived for different strata. The estimators for stratified random sampling are based on simple considerations about linear combinations (Kleinn 2007[1]). When we have two independent random variables \(Y_1\,\) and \(Y_2\,\) and if we are interetsed in the sum of the two \(Y_1+Y_2\,\), then
\[E(Y_1+Y_2)=E(Y_1)+E(Y_2)\,\] and
\[var(Y_1+Y_2)=var(Y_1)+var(Y_2)\,\]
- Simple:
- The expaction value for the sum of both is equal to the sum of both single expection values. It is intuivly clear that we can sum up the totals to derive an overall total. It is different if we consider the variables to be means! In this case we have to weight the single means according to the size of the strata.
If we consider \(Y_1\,\) and \(Y_2\,\) as estimations from the two strata 1 and 2, then we can apply the principles derived from these considerations for stratified sampling.
However, to calculate the overall mean from two estimations, we have to weight the single means in order to account for possibly different sizes of the sub-populations \(N_1\,\) and \(N_2\,\). If both strata are of the same size, we can calculate the mean by:
\[\frac 12 (Y_1+Y_2)=\frac 12 Y_1+\frac 12Y_2=c_1Y_1+c_2Y_2\,\].
The factor \(c_i\,\) can be interpreted as a weight for the single estimations from stratum 1 and 2. Because both are of equal size in this example \(c_1=c_2\,\) holds. A more typical case would be that we deal with strata of unequal sizes.
- Example:
- Weighting of single partial results (or estimations) is important, if they stem from diffent sized sub-populations and we want to calculate a mean. A simple example: You should calculate the mean body weight of 50 students in a classroom. You have a mean value derived for the 15 ladies (55 Kg) and a mean body weight for the 35 men (73 Kg). If you would calculate an unweighted mean (64 Kg) it would be wrong. Correct is 15/50*55+35/50*73=67,6 Kg! The weights 15/50 and 35/50 are an expression of the share of the respective group on the total population. This weight is equal to the selection probability for simple random sampling.
The weights must be proportional to the size of the sub-populations. In case of forest inventories the population is typically an infinite number of dimensionless points (we are selecting sample points in a forest area) and we can describe the size of a sub-population by the area of all these points. The sum of all weights must be 1, so that:
\[\sum c_i=1\,\] The expection value E for different sized strata is then:
\[E(c_1Y_1+c_2Y_2)=E(c_1Y_1)+E(c_2Y_2)=c_1E(Y_1)+c_2E(Y_2)\,\] , where \(c_1 \not= c_2\,\), or
\[E(\sum c_iY_i)=\sum c_iE(Y_i)\,\]
and the variance:
\[var(c_1Y_1+c_2Y_2)=var(c_1Y_1)+var(c_2Y_2)=c_1^2var(Y_1)+c_2^2var(Y_2)\,\] , or
\[var(\sum c_iY_i)=\sum c_i^2var(Y_i)\,\].
- Note:
- If we expand (or like in this case relate) the variance with a factor, this factor must be sqared, because variance is a sqared measure!
Notation
Notation | Bedeutung | |
---|---|---|
\(L\,\) | Number of strata \(h=1, ... , L \,\) | |
\(N\,\) | Total population size | |
\(N_h\,\) | Size of stratum \(h (N=sum N_h)\,\) | |
\(\bar y\,\) | Estimated population mean | |
\(\bar y_h\,\) | Estimated mean of stratum \(h\,\) | |
\(n\,\) | Total sample size | |
\(n_h\,\) | Sample size in stratum \(h\,\) | |
\(S^2_h\,\) | Sample variance in stratum \(h\,\) | |
\(\tau\,\) | Total | |
\(\tau_h\,\) | Total in stratum \(h\,\) | |
\(\hat \tau_h\,\) | Estimated total in stratum \(h\,\) | |
\(c_h\,\) | Relative share of stratum \(h\,\) or weight of stratum | |
\(\hat {var} (\bar y)\,\) | Estimated error variance | |
\(\hat {var} (\hat \tau)\,\) | Estmated error variance of the total |
Estimator for the mean
The estimator for the mean for stratified random sampling is derived based on the considerations above and analog to the estimator of simple random sampling as
\[\bar y = \sum_{h=1}^L \frac{N_h}{N} \bar y_h = \frac {1}{N} \sum_{h=1}^L N_h \bar y_h\,\]
Estimator for the variance
The estimator for the variance (selection without replacement) is:
\[\hat {var} (\bar y) = \sum_{h=1}^L \left\lbrace \left( \frac {N_h}{N} \right)^2 \hat {var} (\bar y_h) \right\rbrace = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {N_h-n_h}{N_h} \frac {S^2_h}{n_h}\].
In this case \(N_h-n_h/N_h\,\) is a finit population correction that is necessary if the strata are small and/or the sample size is large. It is typically applied if the relation between sample size and population is larger than 0.05 (Akca 2001[2]).
- Note:
- A finite population correction is important in case that we apply a selection without replacement and the population size is significantly reduced by drawing the samples. As consequence the selection probabilities are changing with every sample we draw (because the remaining population decreases) what is corrected by this factor.
This estimator looks complex, but can be read as the weighted linear combination of simple random sampling estimators applied to the strata. The weighting factor is in this case
\[c^2=\frac {N_h^2}{N^2}\,\]
Without finite population correction the variance is:
\[\hat {var} (\bar y) = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {S^2_h}{n_h}\].
Estimator of the total
\[\hat\tau = N\bar y = \sum_{h=1}^L \frac {N_h}{N} \hat \tau_h = \sum_{h=1}^L N_h \bar y_h\,\]
The estimated error variance of the estimated population total follows then with:
\[\hat{var}(\hat {\tau}) = \hat{var}(N \bar y) = N^2 \hat{var}(\bar y)\]
To calculate the confidence interval for the estimation we need information about the standard error and the sample size. In stratified random sampling one faces the difficulty with the number of dergees of freedom. While it is a direct function of sample size (\(DF=n-1\,\)) in simple random sampling, we deal with \(h\,\) different sample sizes that are combined. If the variances among these strata differ significantly it is not possible to simply join the degrees of freedom from different strata. This statistical problem is known as Behrens-Fischer problem or Welch-Satterthwaite problem. These statusticiants proposed an approximation formula that can be used to calculate the "effective number of degrees of freedom" so that t-statistics can be approximately correst applied.
\[DF_e=\frac {\left(\sum_{h=1}^L g_h S_h^2 \right)^2}{\sum_{h=1}^L \frac {g_h^2 S_h^4}{n_h-1}}\,\]
Sample size
To determine the necessary sample size that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the diffrent strata is different. These different variances must be weighted to calculate the necessary sample size.
- Note:
- The "necessary" sample size is the estimated number of samples we need to derive an estimation with a defined confidence intervall. This intervall is determined by the predetermined error probability \(\alpha\,\), the corrospondint t-value from the t-distribution and the allowed error we define for the inventory. Typically this error is 10% for standard forest inventories.
Compare the following formula with the formula provided for simple random sampling:
\[n = \frac {t^2 \sum \frac {N^2_h S^2_h}{w_h}}{N^2 A^2}\,\],
where \(w_h = n_h/N\,\), or the share of samples that is in stratum \(h\,\).
- Note:
- To calculate the overall sample size it is obviously necessary to know the share of samples in the different strata?! That sound unlogical, because we like to calculate the sample size here. If we consider that the sample size in each stratum is influencing the expected error, it is clear that we need to have this information. Thats why we have to define the allocation sheme befor we start.
In all sample size calculations one has to know before how to assign the total number of samples to the different strata: the set of \(w_h\,\) must be predetermined!
In small populations (or large samples) we have to consider the fitite population correction and the above formula becomes:
\[n=\frac {\sum_{h=1}^L \frac {N_h^2 S_h^2}{w_h}}{\frac {N^2 A^2}{t^2}+ \sum_{h=1}^L N_hS_h^2}\,\]
Allocation of sample size to the strata
Three factors are relevant in respect to the decision of how to allocate the samples to the strata:
- Startum sizes (The bigger the stratum the more samples)
- Variability inside the strata (The more variability the more samples)
- Inventory costs that might vary between strata (The more costly the fewer samples).
In case that all strata are of same size (equal area) and the variability is also equal, we can use
\[n_h = \frac {n}{L}\,\],
so a uniform allocation of samples to the strata. As mentioned above this situation is not realistic and further stratification would not be superior in this case.
If the number of samples per stratum should be determined proportional to the stratum size (e.g. area) we have a proportional allocation with:
\[n_h = n \frac {N_h}{N}\,\].
In this case one would ingnore possibly different variances inside the strata. If we like to consider the variances, we defenetly need some prior information about the conditions inside the strata that is sometimes available from case studies or forest managament data. In this case one can apply the Optimal allocation:
\[n_h = n \frac {N_h S^2_h}{\sum_{i=1}^L N_i S^2_i}\,\].
If one has to consider inventory costs that might vary significantly between strata, this information can be included as additional factor. In this case one is able to calculate the optimal allocation with cost-minimazation:
\[n_h = n \frac {\frac {N_h S^2_h}{\sqrt {c_h}}}{\sum_{i=1}^L \frac{N_i S^2_i}{\sqrt {c_i}}}\,\]
- Note:
- It is obvious that we need \(n\,\) to calculate the allocation. At the same time we need \(n_h\,\) (the result of this calculation) to determine the sample size. This dilemma can be solved by an iterative process, where we predetermine relative shares (\(w_h\,\)) to derive \(n\,\) in a first step.
Summarazing
Depending on the allocation sheme that should be used one needs the following information to implement stratified random sampling:
- number of strata,
- size of strata or relative share on the total population,
- estimations for the variance inside the strata,
- eventually information about the expected inventory costs in different strata.
Further one has to predefine the target precision (A) (that is the allowed error expressed as 1/2 of the confidence interval width), the error probability \(\alpha\) that is typically 0.05 and the corrosponding t-value from the t-distribution (in case of sample size > 30 this value ist approximately 2).
Based on these Informations one may
- choose an appropiate allocation sheme,
- derive the weigths \(w_h\,\) for the strata,
- calculate the sample size,
- and allocate them to the strata.
Comments
Stratification is a powerful procedure to reduce the error variance if the above mentioned preconditions (homogenious strata with significant differences of strata means) is fulfilled. In this case one takes a maximum of variability out of the sample. Some a priori information is necessary as the subdivision of the population must be devined before the samples are taken. If these information is missing one may apply post-stratification techniques like double sampling for stratification. In this chapter stratified random sampling was introduced. However stratification can also be applied for other sampling techniques. It is important to note that one may apply wathever sampling technique that is appropiate inside the different strata. That means for exapmle that one can optimize the Inventory design as well as the Plot design according to the conditions inside the strata.
- Example:
- A forest area can be subdivided in different stands of diffrent species or age classes. It is now possible to apply different plot designs (e.g. different sizes of sample plots) in the respective strata. While we perhaps need larger plots for the old stands, smaller plots might be sufficiant in jounger and more dense stands.
References
- ↑ 1.0 1.1 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Fakulty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
- ↑ 2.0 2.1 Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfuhrt am Main, 193 S.
- ↑ de Vries, P.G., 1986. Sampling Theorie for Forest Inventory. A Teach-Yourself Course. Springer. 399 p.