Stratified sampling

From AWF-Wiki
(Difference between revisions)
Jump to: navigation, search
(Sample size)
(Sample size)
Line 182: Line 182:
 
==Sample size==
 
==Sample size==
  
To determine the necessary sample size that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the diffrent strata is different. These different variances must be weighted to calculate the necessary sample size.
+
To determine the necessary [[Lecturenotes:sample size|sample size]] that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the diffrent strata is different. These different variances must be weighted to calculate the necessary sample size.
  
  

Revision as of 21:13, 2 December 2008


Contents

Stratified sampling

Stratified sampling is actually not a new Sampling technique of its own, but a procedural method to subdivide a population into seperate and more homogenious sub-populations called strata (Kleinn 2007[1]). The major characteristic is that independent sampling studies are carried out in each stratum where all strata are considered as sub-populations of which the parameters need to be estimated. If random sampling is applied, we call that stratified random sampling.

Startified sampling is efficient especially in those cases where the variability inside the starta is low and the differences of means between the strata is large (Akca 2001[2]). In this case we can achive a higher precision with the same sample size.

Beside statistical issues there are further arguments for stratification. The precondition for a meaningfull partitioning of a population in non-overlapping strata is the availability of prior information that can be used as stratification criteria. In forest inventories these informations might be available in form of forest managament or GIS-data or can be derived from remote sensing data like arial fotos. Most efficiant from a statistical point of view is the stratification of a population proportinal to the target value of the Inventory. As this target value is typivcally not known before the Inventory, forest variables that are correlated to this value are used as stratification criteria. In large managed forest areas the age classe might for example be a good stratification criterion if the estimation of volume per ha is targeted.

Arguments for stratification

Sometimes it is useful to subdivide the population of interest in a number of sub-populations (strata) and carry out an independent sampling in each of these strata. There are statistical as well as practical considerations that makes this technique very favorable and interesting for large area Forest Inventories. Not without reason almost all national forest invetorys are based on stratification.

Statistical justifications
  • The spatial distribution of sample points inside the population is more evenly, if these points are selected in single strata,
  • It is possible to make an individual optimization of sampling and plot design for each stratum,
  • One usually increases the precision of the estimations for the total population,
  • Separate estimations for each of the strata are produced in a pre-planned manner,
  • It is guaranteed that there are actually sufficient observations in each one of the strata.
Practical justification
  • The possibility to optimize the Inventory design seperately for each stratum is very efficient and helps to minimize costs,
  • To facilitate inventory work (particularly field work): independent field campaigns can be carried out in each stratum,
  • It allows a better spezialization of field crews (e.g. botanists).
Construction.png sorry: 

This section is still under construction! This article was last modified on 12/2/2008. If you have comments please use the Discussion page or contribute to the article!


Stratification criteria

For the partitioning of a population we can imagine very different stratification criterias. If the reason for stratification is not an improvement of the accuracy of the estimation, the stratification variables must not necessarily be correlated to the target value. Under special conditions it might be usefull to startify even if there are no statistical justifications. For example might a political boundary dividing a forest area be a good reason for stratification, even if the forest is very homogenious, if afterwords estimations for both parts should be derived seperately. Other examples for meaningfull criteria are:

Geographical startification
  • Ecozones,
  • Forest types,
  • Site and soil types,
  • Topographical conditions,
  • Political boundaries or properties,
  • ...

Further one can imagine to use the expected inventory costs (regarding time consumption) as criterion. These costs are typically correlated to the above mentioned geographical conditions. It might be for example reasonable to stratify a forest area by slope classes, if time consumption for field work differs significantly between flat terrain and steep slopes.

Subject matter stratification
  • Species,
  • Species groups (e.g. commercial / non-commercial),
  • Tree sociological classes,
  • Age classes in plantation forests,
  • ...

Statistics

The estimators for stratified random sampling are based on simple considerations about linear combinations (Kleinn 2007[1]). When we have two independent random variables \(Y_1\,\) and \(Y_2\,\) and if we are interetsed in the sum of the two \(Y_1+Y_2\,\), then


\[E(Y_1+Y_2)=E(Y_1)+E(Y_2)\,\] and


\[var(Y_1+Y_2)=var(Y_1)+var(Y_2)\,\]


info.png Simple:
The expaction value for the sum of both is equal to the sum of both single expection values. It is intuivly clear that we can sum up the totals to derive an overall total. It is different if we consider the variables to be means!


If we consider \(Y_1\,\) and \(Y_2\,\) as estimations from the two strata 1 and 2, then we can apply the principles derived from these considerations for stratified sampling.

However, to calculate the overall mean from two estimations, we have to weight the single means in order to account for possibly different sizes of the sub-populations \(N_1\,\) and \(N_2\,\). If both strata are of the same size, we can calculate the mean by:

\[\frac 12 (Y_1+Y_2)=\frac 12 Y_1+\frac 12Y_2=c_1Y_1+c_2Y_2\,\].

The factor \(c_i\,\) can be interpreted as a weight for the single estimations from stratum 1 and 2. Because both are of equal size in this example \(c_1=c_2\) holds. A more typical case would be that we deal with strata of unequal sizes.


info.png Example:
Weighting of single partial results (or estimations) is important, if they stem from diffent sized sub-populations and we want to calculate a mean. A simple example: You should calculate the mean body weight of 50 students in a classroom. You have a mean value derived for the 15 ladies (55 Kg) and a mean body weight for the 35 men (73 Kg). If you would calculate an unweighted mean (64 Kg) it would be wrong. Correct is 15/50*55+35/50*73=67,6 Kg! The weights 15/50 and 35/50 are an expression of the share of the respective group on the total population. This weight is equal to the selection probability for simple random sampling.

The weights must be proportional to the size of the sub-populations. In case of forest inventories the population is typically an infinite number of dimensionless points (we are selecting sample points in a forest area) and we can describe the size of a sub-population by the area of all these points. The sum of all weights must be 1, so that:

\[\sum c_i=1\,\] The expection value E for different sized strata is then:


\[E(c_1Y_1+c_2Y_2)=E(c_1Y_1)+E(c_2Y_2)=c_1E(Y_1)+c_2E(Y_2)\,\] , where \(c_1 \not= c_2\,\), or


\[E(\sum c_iY_i)=\sum c_iE(Y_i)\,\]


and the variance:

\[var(c_1Y_1+c_2Y_2)=var(c_1Y_1)+var(c_2Y_2)=c_1^2var(Y_1)+c_2^2var(Y_2)\,\] , or


\[var(\sum c_iY_i)=\sum c_i^2var(Y_i)\,\].


info.png Attend:
If we expand (or like in this case relate) the variance with a factor, this factor must be sqared, because variance is a sqared measure!


Notation

Notation Bedeutung
\(L\,\) Number of strata \(h=1, ... , L \,\)
\(N\,\) Total population size
\(N_h\,\) Size of stratum \(h (N=sum N_h)\,\)
\(\bar y\,\) Estimated population mean
\(\bar y_h\,\) Estimated mean of stratum \(h\,\)
\(n\,\) Total sample size
\(n_h\,\) Sample size in stratum \(h\,\)
\(S^2_h\,\) Sample variance in stratum \(h\,\)
\(\tau\,\) Total
\(\tau_h\,\) Total in stratum \(h\,\)
\(\hat \tau_h\,\) Estimated total in stratum \(h\,\)
\(c_h\,\) Relative share of stratum \(h\,\) or weight of stratum
\(\hat {var} (\bar y)\,\) Estimated error variance
\(\hat {var} (\hat \tau)\,\) Estmated error variance of the total


Estimator for the mean

The estimator for the mean for stratified random sampling is derived based on the considerations above and analog to the estimator of simple random sampling as


\[\bar y = \sum_{h=1}^L \frac{N_h}{N} \bar y_h = \frac {1}{N} \sum_{h=1}^L N_h \bar y_h\,\]

Estimator for the variance

The estimator for the variance (selection without replacement) is:


\[\hat {var} (\bar y) = \sum_{h=1}^L \left\lbrace \left( \frac {N_h}{N} \right)^2 \hat {var} (\bar y_h) \right\rbrace = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {N_h-n_h}{N_h} \frac {S^2_h}{n_h}\].


In this case \(N_h-n_h/N_h\,\) is a finit population correction that is necessary if the strata are small and/or the sample size is large. It is typically applied if the relation between sample size and population is larger than 0.05 (Akca 2001[2]).


info.png Attend:
A finite population correction is important in case that we apply a selection without replacement and the population size is significantly reduced by drawing the samples. As consequence the selection probabilities are changing with every sample we draw (because the remaining population decreases) what is corrected by this factor.

Without finite population correction the variance is:


\[\hat {var} (\bar y) = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {S^2_h}{n_h}\].

Estimator of the total

\[\hat\tau = N\bar y = \sum_{h=1}^L \frac {N_h}{N} \hat \tau_h = \sum_{h=1}^L N_h \bar y_h\,\]


The variance of the total is analog:


\[\hat{var}(\hat {\tau}) = \hat{var}(N \bar y) = N^2 \hat{var}(\bar y)\]

Sample size

To determine the necessary sample size that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the diffrent strata is different. These different variances must be weighted to calculate the necessary sample size.


info.png Note:
The "necessary" sample size is the estimated number of samples we need to derive an estimation with a defined confidence intervall. This intervall is determined by the predetermined error probability \(\alpha\), the corrospondint t-value from the t-distribution and the allowed error we define for the inventory. Typically this error is 10% for standard forest inventories.

Compare the following formula with the formula provided for simple random sampling:


\[n = \frac {t^2 \sum \frac {N^2_h S^2_h}{w_h}}{N^2 A^2}\,\],


where \(w_h = n_h/N\), or the share of samples that is in stratum \(h\).


info.png Note:
To calculate the overall sample size it is obviously necessary to know the share of samples in the different strata?! That sound unlogical, because we like to calculate the sample size here. If we consider that the sample size in each stratum is influencing the expected error, it is clear that we need to have this information. Thats why we have to define the allocation sheme befor we start.

Verteilung der Stichproben auf Straten

Bei der Verteilung des Gesamtstichprobenumfangs auf einzelne Straten können verschiedene Kriterien herangezogen werden. Dies sind

  • Die Größe eines Stratums (je größer desto mehr Stichproben)
  • Die Variabilität innerhalb eines Stratums (je höher desto mehr Stichproben)
  • Die Kosten der Inventur, die zwischen den Straten variieren kann (Je höher desto weniger Stichproben).

Für den Fall, dass alle Straten gleich groß sind (gleiche Flächenanteile) und die Variabilität innerhalb der Straten gleich hoch ist, kann

\[n_h = \frac {n}{L}\,\],

also eine Gleichverteilung der Stichproben auf die einzelnen Straten, verwendet werden. Wie oben erwähnt würde die Stratifizierung hier jedoch keine statistischen Vorteile gegenüber einer unstratifizierten Stichprobe mit sich bringen.

Soll die Anzahl der Stichproben proportional zur Größe der Teilpopulationen (z.B. der Flächengröße) ermittelt werden, gilt:

\[n_h = n \frac {N_h}{N}\,\].

Diese Verteilung der Stichproben wird auch als Proportionale Zuteilung bezeichnet. Hierbei wird jedoch die Variabilität inerhalb der Straten nicht berüchsichtigt. Möchte man diese Größe mit berücksichtigen, sind vorab Informationen über die einzelnen Straten notwendig. Informationen über die Varianz könnten z.B. aus einer Voruntersuchung vorliegen. In diesem Fall kann die sog. Neyman - bzw. die Optimale Zuteilung verwendet werden:

\[n_h = n \frac {N_h S^2_h}{\sum_{i=1}^L N_i S^2_i}\,\].

Ergeben sich abweichende Inventurkosten (z.B. aufgrund der Geländebedingungen oder der Bestandesdichte) und ist die Kostenminimierung ein zu berücksichtigendes Ziel der Untersuchung, so können die Kosten in einzelnen Straten (\(c_h\,\) nicht mit den oben genannten Gewichtungsfaktoren zu verwechseln!) einbezogen werden. Hierdurch ergibt sich die Optimale Zuteilung mit Kostenminimierung:

\[n_h = n \frac {\frac {N_h S^2_h}{\sqrt {c_h}}}{\sum_{i=1}^L \frac{N_i S^2_i}{\sqrt {c_i}}}\,\]


info.png Bemerkung:
Hier wird deutlich, dass man \(n\) benötigt, um die Verteilung zu berechnen. Gleichzeitig braucht man \(n_h\), also das Ergebnis dieser Rechnung aber, um den Gesamtstichprobenumfang herzuleiten. Dieses Dilemma lässt sich nur durch ein iteratives Vorgehen lösen, indem zunächst relative Anteile für die Straten vorgegeben werden (z.B. anhand der Flächengröße) um im nächsten Schritt \(n\) zu berechnen.

Praktische Umsetzung

Je nachdem welches Zuteilungsverfahren verwendet werden soll, braucht man für die Stratifizierung folgende Informationen:

  • Anzahl der Straten,
  • Größe bzw. relativer Anteil der Straten an der Population,
  • Schätzungen für die Varianz in den einzelnen Straten,
  • Vorinformationen über die erwarteten Aufnahmekosten (z.B. über Zeitbedarf) in den Straten.

Weiterhin muss, wie in jeder Inventur, Die Präzision (A) für den Gesamtmittelwert vorgegeben werden. Die Irrtumswahrscheinlichkeit ist im allgemeinen mit \(\alpha = 0{,}05\,\) festgelegt.

Auf Grundlage der verfügbaren Informationen kann dann

  • ein angemessenes Zuteilungsverfahren gewählt werden,
  • die Gewichtung \(w_h\) für einzelne Straten berechnet werden,
  • der Gesamtstichprobenumfang hergeleitet werden, und
  • die Anzahl der Stichproben für jedes Stratum bestimmt werden.

Kommentare

Wie bereits erwähnt ist die Aufteilung einer Population in einzelne Straten besonders dann sinnvoll, wenn sich dadurch homogenere Teilpopulationen ergeben. D.h., wenn die Variabilität inerhalb der Straten geringer ist als in der Grundgesamtheit und die Unterschiede zwischen den Straten möglichst groß sind. Das Verhältnis zwischen diesen Varianzen ist dabei natürlich auch von der Anzahl der Straten selber abhängig. Je mehr Straten man bildet, desto geringer wird der Unterschied zwischen den Straten sein. Erfahrungswerte zeigen, dass die Bildung von mehr als 6 Straten nicht sinnvoll ist, da das Verfahren dann an Effektivität verliert.

Um eine Stratifizierung durchführen zu können, sind Vorinformationen absolut notwendig. Diese lassen sich teilweise aus Forsteinrchtungsdaten oder mit Hilfe von Fernerkundungsinformationen herleiten. Die Größe unterschiedlicher Bestandestypen kann bei einer offensichtlichen Abgrenzung beispielsweise durch eine Delinierung auf Grundlage von Luftbildern erreicht werden. Der große Vorteil dieses Verfahrens ist sicherlich, dass man einzelne Straten unabhängig behandeln kann. So können z.B. völlig unterschiedliche Inventurdesigns aber auch Plotdesigns verwendung finden. Diese können jeweils unabhängig für die speziellen Gegebenheiten optimiert werden.


info.png Beispiel:
Eine Waldfläche besteht aus abgegrenzten Altersklassen, deren Flächen zur Stratifizierung herangezogen werden. Es ist nun möglich in jungen und dichten Betsänden kleinere Probekreise zu verwenden, als in den älteren Beständen in einem anderen Stratum. Ebenso kann die Stichprobendichte an die Variabilität angepasst werden.

Wenn die Flächengröße (oder ein anderes Stratifizierungskriterium) vorher nicht bekannt ist, können die Informationen auch von einer Stichprobe geschätzt werden. Dieses Vorgehen wird dann als "Double sampling for stratification" bezeichnet.

Literatur

  1. 1.0 1.1 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Fakulty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
  2. 2.0 2.1 Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfuhrt am Main, 193 S.


<math>\frac {\mathcal{AWF}}{\left [ \left [ Wiki \right ] \right ]}\,</math>
Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export