Stratified sampling

From AWF-Wiki
(Difference between revisions)
Jump to: navigation, search
(Arguments for stratification)
(Estimator of the total)
 
(141 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Languages}}
+
{{Languages}}{{Ficontent}}
 +
Stratified sampling is actually not a new [[lectuenotes:Sampling design and plot design|sampling design]] of its own, but a procedural method to subdivide a [[population]] into separate and more homogeneous sub-populations called [[strata]] (Kleinn 2007<ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.</ref>). In some situations it is useful from a statistical point of view, or required for practical and organizational reasons, to subdivide the population in different strata.
 +
The major characteristic is that independent sampling studies are carried out in each stratum where all strata are considered as sub-populations of which the parameters need to be estimated. If [[Simple random sampling]] is applied, we call that '''stratified random sampling'''.
 +
The only difference between random sampling and stratified random sampling is that the last consists of various sampling studies and the only thing we have to consider is how to combine the estimations that come from the single strata in order to produce estimations for the total population.
  
 +
Stratified sampling is efficient especially in those cases where the [[variability]] inside the strata is low and the differences of means between the strata is large (Akca 2001<ref name="akca">Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfurt am Main, 193 S.</ref>). In this case we can achive a higher [[accuracy and precision|precision]] with the same [[sample size]]. Beside statistical issues there are further arguments for stratification.
  
 +
We can distinguish two general approaches for stratification, the so called pre-stratification in which strata are formed '''''before''''' the sampling study starts, and the post-stratification, where we generate strata in course of the sampling or even afterwards based on the data. In the first case, that is described in this artice, the strata must be defined and - in case of geographical strata - delineated to define the [[Population|sampling frame]].
  
__TOC__
+
The precondition for a meaningful partitioning of a population in non-overlapping strata is the availability of prior information that can be used as stratification criteria (de Vries 1986<ref>de Vries, P.G., 1986. Sampling Theory for Forest Inventory. A Teach-Yourself Course. Springer. 399 p.</ref>). In forest inventories this information might be available in form of forest management or GIS-data or can be derived from remote sensing data like aerial photos. Most efficient from a statistical point of view, is the stratification of a population proportional to the target value of the inventory. As this target value is typically not known before the inventory, forest variables that are correlated to this value are used as stratification criteria. In large managed forest areas age classes or forest types might be for example good stratification criteria if the estimation of [[volume per ha]] is targeted. 
==Stratified sampling==
+
  
Stratified sampling is actually not a new [[lectuenotes:Sampling techniques| Sampling technique]] of its own, but a procedural method to subdivide a population into seperate and more homogenious sub-populations called strata (Kleinn 2007<ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Fakulty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.</ref>). The major characteristic is that independent sampling studies are carried out in each stratum where all strata are considered as sub-populations of which the parameters need to be estimated. If [[Lecturenotes:Simple random sampling|random sampling]] is applied, we call that '''stratified random sampling'''.
 
 
Startified sampling is efficient especially in those cases where the [[Lecturenotes:variability|variability]] inside the starta is low and the differences of means between the strata is large (Akca 2001<ref name="akca">Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfuhrt am Main, 193 S.</ref>). In this case we can achive a higher [[Lecturenotes:precision|precision]] with the same [[Lecturenotes:sample size|sample size]].
 
 
Beside statistical issues there are further arguments for stratification. The precondition for a meaningfull partitioning of a population in non-overlapping strata is the availability of prior information that can be used as stratification criteria. In forest inventories these informations might be available in form of forest managament or GIS-data or can be derived from remote sensing data like arial fotos. Most efficiant from a statistical point of view is the stratification of a population proportinal to the target value of the Inventory. As this target value is typivcally not known before the Inventory, forest variables that are correlated to this value are used as stratification criteria. In large managed forest areas the age classe might for example be a good stratification criterion if the estimation of [[volume per ha]] is targeted. 
 
  
 
===Arguments for stratification===
 
===Arguments for stratification===
Sometimes it is useful to subdivide the population of interest in a number of sub-populations ([[Lecturenotes:stratum|strata]]) and carry out an independent sampling in each of these strata. There are statistical as well as practical considerations that makes this technique very favorable and interesting for large area Forest Inventories. Not without reason almost all [[national forest invetory]]s are based on stratification.
+
Sometimes it is useful to subdivide the [[population]] of interest in a number of sub-populations and carry out an independent sampling in each of these strata. There are statistical as well as practical considerations that make this technique very favorable and interesting for large area forest inventories. Not without reason almost all [[national forest invetories]] are based on stratification.
  
 
;Statistical justifications:
 
;Statistical justifications:
*The spatial distribution of [[Lecturenotes:Sample point|sample point]]s inside the population is more evenly, if these points are selected in single strata,
+
*The spatial distribution of [[Sample point|sample point]]s inside the population is more evenly, if these points are selected in single strata,
*It is possible to make an individual optimization of sampling and plot design for each stratum,
+
*It is possible to make an individual optimization of [[:Category:Sampling design|sampling]] and [[:category:plot design|plot design]] for each stratum,
*One usually increases the precision of the estimations for the total population,
+
*One usually increases the [[accuracy and precision|precision]] of the estimations for the total population,
 
*Separate estimations for each of the strata are produced in a pre-planned manner,
 
*Separate estimations for each of the strata are produced in a pre-planned manner,
*It is guaranteed that there are actually sufficient observations in each one of the strata.
+
*It is guaranteed that there are actually sufficient observations in each one of the strata,
 +
*It is possible to produce estimations with defined precision level for sub-populations.
  
 
;Practical justification:
 
;Practical justification:
*The possibility to optimize the [[Lecturenotes:inventory design|Inventory design]] seperately for each stratum is very efficient and helps to minimize costs,
+
*The possibility to optimize the [[Sampling design and plot design|inventory design]] separately for each stratum is very efficient and helps to minimize costs,
*To facilitate inventory work (particularly field work): independent [[Lecturenotes:field campaigns|field campaigns]] can be carried out in each stratum,
+
*To facilitate inventory work (particularly field work): independent [[field campaign|field campaigns]] can be carried out in each stratum,
*It allows a better spezialization of field crews (e.g. botanists).
+
*It allows a better specialization of field crews (e.g. botanists).
 
+
{{construction}}
+
  
 
===Stratification criteria===
 
===Stratification criteria===
  
Zur Stratenbildung können verschiedene Kriterien als Stratifizierungsmerkmal herangezogen werden. Falls der Grund für eine Stratifizierung nicht die erhöhung der Präzision der Schätzung ist, müssen diese nicht in jedem Fall mit der Zielgröße korrelliert sein. Unter bestimmten Umständen ist eine Aufteilung der Grundgesamtheit auch dann sinnvoll, wenn sich aus statistischer Sicht keine nennenswerte Verbesserung der Schätzung ergibt. Dies ist z.B. der Fall, wenn politische Grenzen eine räumliche Aufteilung von Waldgebieten vorgiebt, weil Inventurergebnisse für jede einzelne Region benötigt werden. Hierbei können auch in sich homogene Flächen einzeln betrachtet werden. Weitere denkbare Stratifizierungsmerkmale sind z.B.:
+
For the partitioning of a population we can imagine very different stratification criteria. If the reason for stratification is not an improvement of the [[accuracy and precision|accuracy]] of estimations, the stratification variables must not necessarily be [[correlation|correlated]] to the target value. Under special conditions it might be useful to stratify even if there are no statistical justifications. For example might a political boundary dividing a forest area be a good reason for stratification, even if the forest is very homogeneous, if afterwards estimations for both parts should be derived seperately. Other examples for meaningfull criteria are:
*Topografische Gegebenheiten (z.B. Höhenschichten),
+
 
*unterschiedliche [[Lecturenotes:Stand type/de|Bestandestyp]]en,
+
;Geographical startification:
*[[Lecturenotes:Age class/de|Altersklasse]]n (nicht in Naturwäldern)
+
*Eco zones,
*Bodentypen, Nährstoffversorgung,
+
*[[Forest types]],
*Wuchsgebiete,
+
*Site and soil types,
*Baumarten,
+
*Topographical conditions,
 +
*Political boundaries or properties,
 
*...
 
*...
  
Weiterhin können auch die Inventurkosten als Stratifizierungskriterium berücksichtigt werden. Diese sind normalerweise mit den oben genannten Kriterien korrelliert. So könnte z.B. eine Stratifizierung nach Hangneigungsstufen denkbar sein, wenn die Kosten (zeitbedarf) der Feldaufnahmen zwischen den Straten erheblich variieren. Möchte man die Aufnahmekosten bei der Herleitung der Verteilung von Stichproben auf einzelne Straten berücksichtigen, könnte das in diesem Fall zu einem anderen Ergebnis führen als eine einfache Verteilung ([[Lecturenotes:Allocation of sample points/de|Allokation]]) der [[Lecturenotes:Sample point/de|Stichprobenpunkte]].
+
Further one can imagine to use the expected inventory costs (regarding time consumption) as criterion. These costs are typically correlated to the above mentioned geographical conditions. It might be for example reasonable to stratify a forest area by slope classes, if time consumption for field work differs significantly between flat terrain and steep slopes.
  
==Statistik==
+
;Subject matter stratification
 +
*Species,
 +
*Species groups (e.g. commercial / non-commercial),
 +
*Tree sociological classes,
 +
*Age classes in plantation forests,
 +
*...
  
 +
==Statistics==
  
Die [[Lecturenotes:Estimator/de|Schätzer]] für die Stratifizierte Stichprobe beruhen auf einfachen Linearkombinationen (Kleinn 2007<ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Fakulty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.</ref>). Angenommen wir betrachten zwei unabhängige Zufallsvariablen <math>Y_1\,</math> und <math>Y_2\,</math> und interessieren uns für die Summe der beiden <math>Y_1+Y_2\,</math>, dann ist
+
The only new concept that needs to be introduced in stratified random sampling is how to combine the estimates derived for different strata.
 
+
The [[Estimator|estimator]]s for stratified random sampling are based on simple considerations about [[linear combination]]s (Kleinn 2007<ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.</ref>).  
 
+
When we have two '''''independent''''' random variables <math>Y_1\,</math> and <math>Y_2\,</math> and if we are interested in the sum of the two <math>Y_1+Y_2\,</math>, then
:<math>E(Y_1+Y_2)=E(Y_1)+E(Y_2)\,</math> und
+
  
 +
:<math>E(Y_1+Y_2)=E(Y_1)+E(Y_2)\,</math> and
  
 
:<math>var(Y_1+Y_2)=var(Y_1)+var(Y_2)\,</math>
 
:<math>var(Y_1+Y_2)=var(Y_1)+var(Y_2)\,</math>
  
 
{{info
 
{{info
|message=Einfach:
+
|message=Simple:
|text=Der Erwartungswert E der Summe beider Variablen ist gleich der Summe der einzelnen Erwartungswerte. Es erscheint logisch, dass wir die einzelnen Summen einfach summieren und so eine Gesamtsumme erhalten. Anders ist das bei Mittelwerten.
+
|text=The [[expected value]] for the sum of both is equal to the sum of both single expected values. It is intuitively clear that we can sum up the totals to derive an overall total. It is different if we consider the variables to be means! In this case we have to weight the single means according to the size of the strata.
 
}}
 
}}
  
 +
If we consider <math>Y_1\,</math> and <math>Y_2\,</math> as estimations from the two strata 1 and 2, then we can apply the principles derived from these considerations for stratified sampling.
  
Wenn <math>Y_1\,</math> und <math>Y_2\,</math> Schätzungen aus den zwei Straten 1 und 2 sind können wir diese Grundlagen für die Stratifizierte Stichprobe nutzen.
+
However, to calculate the overall [[Mean|mean]] from two estimations, we have to weight the single means in order to account for possibly different sizes of the sub-populations <math>N_1\,</math> and <math>N_2\,</math>. If both strata are of the '''same size''', we can calculate the mean by:
 
+
Ist die zu Schätzende Zielgröße beispielsweise ein [[Lecturenotes:Mean/de|Mittelwert]] (z.B. mittleres Volumen pro ha) müssen wir bedenken, dass die Straten eventuell ungleich groß sind. Im Fall von '''gleichgroßen''' Straten gilt:
+
  
 
:<math>\frac 12 (Y_1+Y_2)=\frac 12 Y_1+\frac 12Y_2=c_1Y_1+c_2Y_2\,</math>
 
:<math>\frac 12 (Y_1+Y_2)=\frac 12 Y_1+\frac 12Y_2=c_1Y_1+c_2Y_2\,</math>
  
Der factor <math>c\,</math> kann als Gewichtungsfaktor der einzelnen Schätzungen aus 1 und 2 angesehen werden. Da hier beide Straten den gleichen Umfang haben, ist hier <math>c_1=c_2\,</math>.
+
The factor <math>c_i\,</math> can be interpreted as a weight for the single estimations from stratum 1 and 2. Because both are of equal size in this example <math>c_1=c_2\,</math> holds.
 
+
A more typical case would be that we deal with strata of unequal sizes.
Der Normalfall wird eher sein, dass die Straten ungleich groß sind. Wir müssen die Schätzungen aus den einzelnen Straten daher unterschiedlich gewichten.
+
  
 
{{info
 
{{info
|message=Beispiel:
+
|message=Example:
|text=Eine Gewichtung einzelner Teilergebnisse (oder Schätzungen) ist immer dann wichtig, wenn die Teilergebnisse aus unterschiedlich großen Teilpopulationen stammen und hieraus ein Gesamtmittelwert berechnet werden soll. Einfaches Beispiel: Es soll das mittlere Körpergewicht von 50 Studenten ermittelt werden. Es wurde ein Mittleres Körpergewicht der 15 Frauen (55 Kg) und ein Mittelwert für die 35 Männer (73 Kg) berechnet. Würden wir einen ungewichteten Mittelwert über beide Gruppen berechnen (64 Kg) wäre das falsch. Richtig ist 15/50*55+35/50*73=67,6 Kg! Die Gewichte 15/50 bzw. 35/50 sind dabei Ausdruck des Anteils dieser Gruppe an der Gesamtpopulation.
+
|text=Weighting of single partial results (or estimations) is important, if they stem from different sized sub-populations and we want to calculate a mean. A simple example: You should calculate the mean body weight of 50 students in a classroom. You have a mean value derived for the 15 ladies (55 Kg) and a mean body weight for the 35 men (73 Kg). If you would calculate an unweighted mean (64 Kg) it would be wrong. Correct is 15/50*55+35/50*73=67,6 Kg! The weights 15/50 and 35/50 are an expression of the share of the respective group on the total population. This weight is equal to the selection probability for simple random sampling.
 
}}
 
}}
  
Die Gewichte müssen hierbei proportional zur Größe der Teilpopulationen in den jeweiligen Straten sein. Im Rahmen von Inventuren besteht die Population in den meisten Fällen aus einer unendlichen Zahl von möglichen Stichprobenpunkten, deren Größe wir durch die Fläche der einzelnen Straten ausdrücken. Die Summe der einzelnen Gewichte muss 1 sein, es gilt also:
+
 
 +
The weights must be proportional to the size of the sub-populations. In case of forest inventories the population is typically the total forest area (we are selecting [[sample point]]s from this continuum), so that the individual stratum areas can be used to derive weighting factors. The sum of all weights must be 1, so that:
  
 
:<math>\sum c_i=1\,</math>
 
:<math>\sum c_i=1\,</math>
 +
The expected value E for '''different sized''' strata is then:
  
Der Erwartungswert E für '''ungleich große''' Straten ist daher:
+
:<math>E(c_1Y_1+c_2Y_2)=E(c_1Y_1)+E(c_2Y_2)=c_1E(Y_1)+c_2E(Y_2)\,</math> , where <math>c_1 \not= c_2\,</math> or
  
 +
:<math>E(\sum c_iY_i)=\sum c_iE(Y_i)\,</math>
  
:<math>E(c_1Y_1+c_2Y_2)=E(c_1Y_1)+E(c_2Y_2)=c_1E(Y_1)+c_2E(Y_2)\,</math> , wobei <math>c_1 \not= c_2\,</math> ist, oder
+
and the variance:
  
 +
:<math>var(c_1Y_1+c_2Y_2)=var(c_1Y_1)+var(c_2Y_2)=c_1^2var(Y_1)+c_2^2var(Y_2)\,</math>  or
  
:<math>E(\sum c_iY_i)=\sum c_iE(Y_i)\,</math>.
+
:<math>var(\sum c_iY_i)=\sum c_i^2var(Y_i)\,</math>
 
+
 
+
 
+
Analog ist die Varianz:
+
 
+
:<math>var(c_1Y_1+c_2Y_2)=var(c_1Y_1)+var(c_2Y_2)=c_1^2var(Y_1)+c_2^2var(Y_2)\,</math> , oder
+
 
+
 
+
:<math>var(\sum c_iY_i)=\sum c_i^2var(Y_i)\,</math>.
+
  
 
{{info
 
{{info
|message=Beachte:
+
|message=Note:
|text=Immer wenn eine Varianz erweitert (oder wie hier durch einen Gewichtungsfaktor relativiert) wird, muss der Faktor quadriert werden, da die Varianz eine quadratische Größe ist!
+
|text=If we expand (or like in this case relate) the variance with a constant factor, this factor must be squared, because variance is a squared measure!
 
}}
 
}}
 
  
 
===Notation===
 
===Notation===
  
{| class="wikitable"
+
{|
 
|-
 
|-
! Notation !! Bedeutung
+
! !!  
 
|-
 
|-
| <math>L\,</math> || Anzahl der Straten <math>h=1, ... , L \,</math> ||
+
| <math>L\,</math> || Number of strata <math>h=1, ... , L \,</math> ||
 
|-
 
|-
| <math>N\,</math> || Gesamtgröße der Population ||
+
| <math>N\,</math> || Total population size ||
 
|-
 
|-
| <math>N_h\,</math> || Größe des Stratums <math>h (N=sum N_h)\,</math> ||
+
| <math>N_h\,</math> || Size of stratum <math>h (N=\sum N_h)\,</math> ||
 
|-
 
|-
| <math>\bar y\,</math> || Geschätzter Mittelwert der Population ||
+
| <math>\bar y\,</math> || Estimated population mean ||
 
|-
 
|-
| <math>\bar y_h\,</math> || Geschätzter Mittelwert im Stratum <math>h\,</math> ||
+
| <math>\bar y_h\,</math> || Estimated mean of stratum <math>h\,</math> ||
 
|-
 
|-
| <math>n\,</math> || Stichprobenumfang ||
+
| <math>n\,</math> || Total sample size ||
 
|-
 
|-
| <math>n_h\,</math> ||  Stichprobenumfang in Stratum <math>h\,</math>||
+
| <math>n_h\,</math> ||  Sample size in stratum <math>h\,</math>||
 
|-
 
|-
| <math>S^2_h\,</math> || Stichprobenvarianz in Stratum <math>h\,</math> ||
+
| <math>S^2_h\,</math> || Sample variance in stratum <math>h\,</math> ||
 
|-
 
|-
 
| <math>\tau\,</math> || Total ||
 
| <math>\tau\,</math> || Total ||
 
|-
 
|-
| <math>\tau_h\,</math> || Total in Stratum <math>h\,</math> ||
+
| <math>\tau_h\,</math> || Total in stratum <math>h\,</math> ||
 
|-
 
|-
| <math>\hat \tau_h\,</math> || Geschätztes Total in Stratum <math>h\,</math> ||
+
| <math>\hat \tau_h\,</math> || Estimated total in stratum <math>h\,</math> ||
 
|-
 
|-
| <math>c_h\,</math> || Relativer Anteil des Stratum <math>h\,</math> bzw. Gewicht des Stratums ||
+
| <math>c_h\,</math> || Relative share of stratum <math>h\,</math> or weight of stratum ||
 
|-
 
|-
| <math>\hat {var} (\bar y)\,</math> || Geschätzte Fehlervarianz des geschätzten Populationsmittelwertes ||
+
| <math>\hat {var} (\bar y)\,</math> || Estimated error variance ||
 
|-
 
|-
| <math>\hat {var} (\hat \tau)\,</math> || Geschätzte Fehlervarianz des Total ||
+
| <math>\hat {var} (\hat \tau)\,</math> || Estimated error variance of the total ||
 +
|-
 +
| <math>k_h\,</math> || Estimated costs for the assessment in stratum <math>h\,</math> ||
 
|}
 
|}
  
 +
===Estimator for the mean===
  
===Schätzer des Mittelwertes===
+
The estimator for the mean for stratified random sampling is derived based on the considerations above and analog to the estimator of [[Simple random sampling|simple random sampling]] as
 
+
Der Schätzer des Mittelwertes für die Stratifizierte Stichprobe ergibt sich analog zu den oben dargestellten Überlegungen (und auf Grundlage der vorgestellten [[Lecturenotes:Simple random sampling/de|Schätzer der einfachen Zufallsstichprobe]]) als:
+
  
 
:<math>\bar y = \sum_{h=1}^L \frac{N_h}{N} \bar y_h  =  \frac {1}{N} \sum_{h=1}^L N_h \bar y_h\,</math>
 
:<math>\bar y = \sum_{h=1}^L \frac{N_h}{N} \bar y_h  =  \frac {1}{N} \sum_{h=1}^L N_h \bar y_h\,</math>
  
===Varianzschätzer===
+
===Estimator for the error variance===
 
+
Der Varianzschätzer für eine Auswahl ohne Zurücklegen kann wie folgt hergeleitet werden:
+
  
:<math>\hat {var} (\bar y) = \sum_{h=1}^L \left\lbrace \left( \frac {N_h}{N} \right)^2 \hat {var} (\bar y_h) \right\rbrace  = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {N_h-n_h}{N_h} \frac {S^2_h}{n_h}</math>.
+
The estimator for the error variance (selection without replacement) is:
  
 +
:<math>\hat {var} (\bar y) = \sum_{h=1}^L \left\lbrace \left( \frac {N_h}{N} \right)^2 \hat {var} (\bar y_h) \right\rbrace  = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {N_h-n_h}{N_h} \frac {S^2_h}{n_h}</math>
  
Hierbei ist <math>N_h-n_h/N_h\,</math> eine [[Lecturenotes:Infinit population correction/de|Endlichkeitskorrektur]], die nur dann verwendet wird, wenn die Straten klein bzw. das Verhältnis zwischen Stichprobenumfang und Populationsumfang größer als 0,05 ist (Akca 2001<ref name="akca">Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfuhrt am Main, 193 S.</ref>).
+
In this case <math>N_h-n_h/N_h\,</math> is a [[finit population correction]] that is necessary if the strata are small and/or the sample size is large. It is typically applied if the relation between sample size and population is larger than 0.05 (Akca 2001<ref name="akca">Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfurt am Main, 193 S.</ref>).
  
 
{{info
 
{{info
|message=Beachte:
+
|message=Note:
|text=Eine Endlichkeitskorrektur ist immer dann nötig, wenn Ziehen ohne Zurücklegen verwendet wird und der Populationsumfang durch die Stichprobenziehung in bemerkenswertem Umfang verringert wird. Hierdurch ändern sich die Auswahlwahrscheinlichkeiten bei jedem ziehen eines Stichprobenelementes, was durch die Endlichkeitskorrektur ausgeglichen wird.
+
|text=A finite population correction is important in case that we apply a selection without replacement and the population size is significantly reduced by drawing the samples. As consequence the selection probabilities are changing with every sample we draw (because the remaining population decreases) what is corrected by this factor.
 
}}
 
}}
  
Ohne die Endlichkeitskorrektur ergibt sich also:
 
  
 +
This estimator looks complex, but can be read as the weighted linear combination of simple random sampling estimators applied to the strata. The weighting factor is in this case
  
:<math>\hat {var} (\bar y) = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {S^2_h}{n_h}</math>.
+
:<math>c^2=\frac {N_h^2}{N^2}\,</math>
  
===Schätzer des Total===
+
Without finite population correction the error variance is:
  
:<math>\hat\tau = N\bar y = \sum_{h=1}^L \frac {N_h}{N} \hat \tau_h = \sum_{h=1}^L N_h \bar y_h\,</math>
+
:<math>\hat {var} (\bar y) = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {S^2_h}{n_h}</math>
  
 +
===Estimator of the total===
  
Die Varianz des Total ist demnach:
+
:<math>\hat\tau = N\bar y = \sum_{h=1}^L \hat \tau_h = \sum_{h=1}^L N_h \bar y_h\,</math>
  
 +
The estimated error variance of the estimated population total follows then with:
  
 
:<math>\hat{var}(\hat {\tau}) = \hat{var}(N \bar y) = N^2 \hat{var}(\bar y)</math>
 
:<math>\hat{var}(\hat {\tau}) = \hat{var}(N \bar y) = N^2 \hat{var}(\bar y)</math>
  
==Stichprobenumfang==
+
To calculate the [[confidence interval]] for the estimation we need information about the standard error and the sample size. In stratified random sampling one faces the difficulty with the number of [[dergees of freedom]]. While it is a direct function of sample size (<math>DF=n-1\,</math>) in simple random sampling, we deal with <math>h\,</math> different sample sizes that are combined. If the variances among these strata differ significantly it is not possible to simply join the degrees of freedom from different strata. This statistical problem is known as '''''Behrens-Fischer problem''''' or '''''Welch-Satterthwaite problem'''''. These statisticians proposed an approximation formula that can be used to calculate the "effective number of degrees of freedom" so that t-statistics can be approximately correct applied.
  
Bei der Herleitung des nötigen [[Lecturenotes:Total sample size/de|Stichprobenumfang]]s, der immer durch den vogegebenen zulässigen Fehler, das statistische Sicherungsniveau und durch die Variabilität innerhalb der Population beeinflusst wird, muss bedacht werden, dass die Varianz in den einzelnen Straten unterschiedlich ist.
+
:<math>DF_e=\frac {\left(\sum_{h=1}^L g_h S_h^2 \right)^2}{\sum_{h=1}^L \frac {g_h^2 S_h^4}{n_h-1}}\,</math>
Diese unterschiedlichen Varianzen müssen demnach (gewichtet) in die Berechnung des nötigen Stichprobenumfangs eingehen.
+
 
 +
==Sample size==
 +
 
 +
To determine the necessary [[sample size|sample size]] that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the different strata is different. These different variances must be weighted to calculate the necessary sample size.
  
 
{{info
 
{{info
|message=Bemerkung:
+
|message=Note:
|text=Der "nötige" Stichprobenumfang ist die geschätzte Anzahl von Stichproben, die man benötigt, um ein vorgegebenen Fehler mit einem vorgegeben statistischen Sicherungsniveau einzuhalten. Das Sicherungsniveau ergibt sich aus der Irrtumswahrscheinlichkeit alpha, zu der ein ''t''-Wert aus der Student-t Verteilung gehört. Der Vorgegebe zulässige Fehler A ist bei Waldinventuren oft mit 10% angegeben.
+
|text=The "necessary" sample size is the estimated number of samples we need to derive an estimation with a defined [[confidence interval]]. This interval is determined by the predetermined error probability <math>\alpha\,</math>, the corresponding t-value from the t-distribution and the allowed error we define for the inventory. Typically this error is 10% for standard forest inventories.
 
}}
 
}}
  
Vergleiche die Folgende Formel auch mit der für die einfache [[Lecturenotes:Simple random sampling/de|Zufallsstichprobe]]:
+
Compare the following formula with the formula provided for [[Sample size|simple random sampling]]:
  
 +
:<math>n = \frac {t^2 \sum \frac {N^2_h S^2_h}{c_h}}{N^2 A^2}\,</math>
  
:<math>n = \frac {t^2 \sum \frac {N^2_h S^2_h}{w_h}}{N^2 A^2}\,</math>,
+
where <math>c_h = n_h/n\,</math>, or the share of samples that is in stratum <math>h\,</math>.
  
 +
{{info
 +
|message=Note:
 +
|text=To calculate the overall sample size it is obviously necessary to know the share of samples in the different strata?! That sounds strange, because we like to calculate the sample size here. If we consider that the sample size in each stratum is influencing the expected error, it is clear that we need to have this information. Thats why we have to define the allocation scheme before we start.
 +
}}
  
wobei <math>w_h = n_h/N,</math>, also der Anteil des Stichprobenumfangs, der in Stratum <math>h</math> fällt.
+
In all sample size calculations one has to know ''before'' how to assign the total number of samples to the different strata: the set of <math>c_h\,</math> must be predetermined!
  
{{info
+
In small populations (or large samples) we have to consider the finite population correction and the above formula becomes:
|message=Bemerkung:
+
|text=Für die Berechnung des Gesamtstichprobenumfangs ist es nötig, vorher zu wissen, wie groß der Anteil bzw. wie hoch die Anzahl in einzelnen Straten ist?! Das hört sich zunächst unlogisch an, da wir ja gerade die Anzahl nötiger Stichproben berechnen wollen. Bedenkt man aber, dass es hier darum geht, den erwarteten Fehler in jedem Stratum einzubeziehen, ist es logisch, dass wir eine Vorgabe für die Anzahl der Stichproben benötigen.
+
}}
+
 
+
Hierzu muss die Zuteilung der Stichproben zu einzelnen Straten vorher definiert werden.
+
  
==Verteilung der Stichproben auf Straten==
+
:<math>n=\frac {\sum_{h=1}^L \frac {N_h^2 S_h^2}{c_h}}{\frac {N^2 A^2}{t^2}+ \sum_{h=1}^L N_h S_h^2}\,</math>
  
Bei der Verteilung des Gesamtstichprobenumfangs auf einzelne Straten können verschiedene Kriterien herangezogen werden. Dies sind
+
==Allocation of sample size to the strata==
 +
Three factors are relevant in respect to the decision of how to allocate the samples to the strata:
  
*Die Größe eines Stratums (je größer desto mehr Stichproben)
+
*Stratum sizes (The bigger the stratum the more samples)
*Die Variabilität innerhalb eines Stratums (je höher desto mehr Stichproben)
+
*Variability inside the strata (The more variability the more samples)
*Die Kosten der Inventur, die zwischen den Straten variieren kann (Je höher desto weniger Stichproben).
+
*Inventory costs that might vary between strata (The more costly the fewer samples).
  
Für den Fall, dass alle Straten gleich groß sind (gleiche Flächenanteile) und die Variabilität innerhalb der Straten gleich hoch ist, kann
+
In case that all strata are of same size (equal area) and the variability is also equal, we can use
  
:<math>n_h = \frac {n}{L}\,</math>,
+
:<math>n_h = \frac {n}{L}\,</math>
  
also eine Gleichverteilung der Stichproben auf die einzelnen Straten, verwendet werden. Wie oben erwähnt würde die Stratifizierung hier jedoch keine statistischen Vorteile gegenüber einer unstratifizierten Stichprobe mit sich bringen.
+
so a '''uniform allocation''' of samples to the strata. As mentioned above this situation is not realistic and further stratification would not be superior in this case.
  
Soll die Anzahl der Stichproben proportional zur Größe der Teilpopulationen (z.B. der Flächengröße) ermittelt werden, gilt:
+
If the number of samples per stratum should be determined proportional to the stratum size (e.g. area) we have a '''proportional allocation''' with:
  
:<math>n_h = n \frac {N_h}{N}\,</math>.
+
:<math>n_h = n \frac {N_h}{N}\,</math>
  
Diese Verteilung der Stichproben wird auch als '''Proportionale Zuteilung''' bezeichnet. Hierbei wird jedoch die Variabilität inerhalb der Straten nicht berüchsichtigt. Möchte man diese Größe mit berücksichtigen, sind vorab Informationen über die einzelnen Straten notwendig. Informationen über die Varianz könnten z.B. aus einer Voruntersuchung vorliegen. In diesem Fall kann die sog. '''Neyman''' - bzw. die '''Optimale Zuteilung''' verwendet werden:
+
In this case one would ignore possibly different variances inside the strata. If we like to consider the variances, we need some prior information about the conditions inside the strata that is sometimes available from case studies or forest management data. In this case one can apply the '''Neyman allocation''':
  
:<math>n_h = n \frac {N_h S^2_h}{\sum_{i=1}^L N_i S^2_i}\,</math>.
+
:<math>n_h = n \frac {N_h S^2_h}{\sum_{h=1}^L N_h S^2_h}\,</math>
  
Ergeben sich abweichende Inventurkosten (z.B. aufgrund der Geländebedingungen oder der Bestandesdichte) und ist die Kostenminimierung ein zu berücksichtigendes Ziel der Untersuchung, so können die Kosten in einzelnen Straten (<math>c_h\,</math> nicht mit den oben genannten Gewichtungsfaktoren zu verwechseln!) einbezogen werden. Hierdurch ergibt sich die '''Optimale Zuteilung mit Kostenminimierung''':
+
If one has to consider inventory costs that might vary significantly between strata, this information can be included as additional factor. In this case one is able to calculate the '''optimal allocation with cost-minimization''':  
  
:<math>n_h = n \frac {\frac {N_h S^2_h}{\sqrt {c_h}}}{\sum_{i=1}^L \frac{N_i S^2_i}{\sqrt {c_i}}}\,</math>
+
:<math>n_h = n \frac {\frac {N_h S^2_h}{\sqrt {k_h}}}{\sum_{h=1}^L \frac{N_h S^2_h}{\sqrt {k_h}}}\,</math>
  
 
{{info
 
{{info
|message=Bemerkung:
+
|message=Note:
|text=Hier wird deutlich, dass man <math>n</math> benötigt, um die Verteilung zu berechnen. Gleichzeitig braucht man <math>n_h</math>, also das Ergebnis dieser Rechnung aber, um den Gesamtstichprobenumfang herzuleiten. Dieses Dilemma lässt sich nur durch ein iteratives Vorgehen lösen, indem zunächst relative Anteile für die Straten vorgegeben werden (z.B. anhand der Flächengröße) um im nächsten Schritt <math>n</math> zu berechnen.
+
|text=It is obvious that we need <math>n</math> to calculate the allocation. At the same time we need <math>n_h</math> (the result of this calculation) to determine the sample size. This dilemma can be solved by an iterative process, where we predetermine relative shares (<math>c_h</math>) to derive <math>n</math> in a first step.
 
}}
 
}}
  
==Praktische Umsetzung==
+
==Summarizing==
  
Je nachdem welches Zuteilungsverfahren verwendet werden soll, braucht man für die Stratifizierung folgende Informationen:
+
Depending on the allocation scheme that should be used one needs the following information to implement stratified random sampling:
*Anzahl der Straten,
+
*number of strata,
*Größe bzw. relativer Anteil der Straten an der Population,
+
*size of strata or relative share on the total population,
*Schätzungen für die Varianz in den einzelnen Straten,
+
*estimations for the variance inside the strata,
*Vorinformationen über die erwarteten Aufnahmekosten (z.B. über Zeitbedarf) in den Straten.
+
*eventually information about the expected inventory costs in different strata.
  
Weiterhin muss, wie in jeder Inventur, Die Präzision (A) für den Gesamtmittelwert vorgegeben werden. Die Irrtumswahrscheinlichkeit ist im allgemeinen mit <math>\alpha = 0{,}05\,</math> festgelegt.
+
Further one has to predefine the target precision (A) (that is the allowed error expressed as 1/2 of the confidence interval width), the error probability <math>\alpha</math> that is typically 0.05 and the corresponding t-value from the t-distribution (in case of sample size > 30 this value is approximately 2).
  
Auf Grundlage der verfügbaren Informationen kann dann
+
Based on these Informations one may
*ein angemessenes Zuteilungsverfahren gewählt werden,
+
*choose an appropriate allocation scheme,
*die Gewichtung <math>w_h</math> für einzelne Straten berechnet werden,
+
*derive the weights <math>c_h\,</math> for the strata,
*der Gesamtstichprobenumfang hergeleitet werden, und
+
*calculate the sample size,
*die Anzahl der Stichproben für jedes Stratum bestimmt werden.
+
*and allocate them to the strata.
  
==Kommentare==
+
{{Exercise
 +
|message=Stratified sampling examples
 +
|alttext=Example
 +
|text=Example for Stratified sampling
 +
}}
  
Wie bereits erwähnt ist die Aufteilung einer Population in einzelne Straten besonders dann sinnvoll, wenn sich dadurch homogenere Teilpopulationen ergeben. D.h., wenn die Variabilität inerhalb der Straten geringer ist als in der Grundgesamtheit und die Unterschiede zwischen den Straten möglichst groß sind. Das Verhältnis zwischen diesen Varianzen ist dabei natürlich auch von der Anzahl der Straten selber abhängig. Je mehr Straten man bildet, desto geringer wird der Unterschied zwischen den Straten sein. Erfahrungswerte zeigen, dass die Bildung von mehr als 6 Straten nicht sinnvoll ist, da das Verfahren dann an Effektivität verliert.
+
==Comments==
  
Um eine Stratifizierung durchführen zu können, sind Vorinformationen absolut notwendig. Diese lassen sich teilweise aus Forsteinrchtungsdaten oder mit Hilfe von Fernerkundungsinformationen herleiten. Die Größe unterschiedlicher Bestandestypen kann bei einer offensichtlichen Abgrenzung beispielsweise durch eine Delinierung auf Grundlage von Luftbildern erreicht werden.
+
Stratification is a powerful procedure to reduce the error variance if the above mentioned preconditions (homogeneous strata with significant differences of strata means) is fulfilled. In this case one takes a maximum of variability out of the sample.
Der große Vorteil dieses Verfahrens ist sicherlich, dass man einzelne Straten unabhängig behandeln kann. So können z.B. völlig unterschiedliche [[Lecturenotes:Inventory design/de|Inventurdesign]]s aber auch [[Lecturenotes:Plot design/de|Plotdesign]]s verwendung finden. Diese können jeweils unabhängig für die speziellen Gegebenheiten optimiert werden.
+
Some a priori information is necessary as the subdivision of the population must be defined before the samples are taken. If these information is missing one may apply techniques like [[Double_sampling#Double_sampling_for_stratification_.28DSS.29|double sampling for stratification]] that employs stratification without requiring a priori delineation of strata (the strata sizes are estimated in the course of a two-phase sampling process).
 +
In this chapter stratified random sampling was introduced. However stratification can also be applied for other sampling techniques. It is important to note that one may apply whatever sampling technique that is appropriate inside the different strata. That means for example that one can optimize the [[Sampling design and plot design|Inventory design]] as well as the [[Sampling design and plot design|Plot design]] according to the conditions inside the strata.
  
 
{{info
 
{{info
|message=Beispiel:
+
|message=Example:
|text=Eine Waldfläche besteht aus abgegrenzten Altersklassen, deren Flächen zur Stratifizierung herangezogen werden. Es ist nun möglich in jungen und dichten Betsänden kleinere Probekreise zu verwenden, als in den älteren Beständen in einem anderen Stratum. Ebenso kann die Stichprobendichte an die Variabilität angepasst werden.
+
|text=A forest area can be subdivided in different stands of different species or age classes. It is now possible to apply different plot designs (e.g. different sizes of sample plots) in the respective strata. While we perhaps need larger plots for the old stands, smaller plots might be sufficient in younger and more dense stands.
 
}}
 
}}
  
Wenn die Flächengröße (oder ein anderes Stratifizierungskriterium) vorher nicht bekannt ist, können die Informationen auch von einer Stichprobe geschätzt werden. Dieses Vorgehen wird dann als "Double sampling for stratification" bezeichnet.
+
==References==
 
+
==Literatur==
+
 
<references/>
 
<references/>
  
{{FSWr}}
+
{{SEO
[[Category:Forest Inventory lecturenotes]]
+
|keywords=stratified random sampling,strata,population,sampling technique,sub-population
 +
|descrip=Stratified sampling is a method to subdivide a population into separate and more homogeneous sub-populations called strata.
 +
}}
 +
 
 +
[[Category:Sampling design]]
 +
[[Category:Article of the month]]

Latest revision as of 11:07, 10 February 2024

Stratified sampling is actually not a new sampling design of its own, but a procedural method to subdivide a population into separate and more homogeneous sub-populations called strata (Kleinn 2007[1]). In some situations it is useful from a statistical point of view, or required for practical and organizational reasons, to subdivide the population in different strata. The major characteristic is that independent sampling studies are carried out in each stratum where all strata are considered as sub-populations of which the parameters need to be estimated. If Simple random sampling is applied, we call that stratified random sampling. The only difference between random sampling and stratified random sampling is that the last consists of various sampling studies and the only thing we have to consider is how to combine the estimations that come from the single strata in order to produce estimations for the total population.

Stratified sampling is efficient especially in those cases where the variability inside the strata is low and the differences of means between the strata is large (Akca 2001[2]). In this case we can achive a higher precision with the same sample size. Beside statistical issues there are further arguments for stratification.

We can distinguish two general approaches for stratification, the so called pre-stratification in which strata are formed before the sampling study starts, and the post-stratification, where we generate strata in course of the sampling or even afterwards based on the data. In the first case, that is described in this artice, the strata must be defined and - in case of geographical strata - delineated to define the sampling frame.

The precondition for a meaningful partitioning of a population in non-overlapping strata is the availability of prior information that can be used as stratification criteria (de Vries 1986[3]). In forest inventories this information might be available in form of forest management or GIS-data or can be derived from remote sensing data like aerial photos. Most efficient from a statistical point of view, is the stratification of a population proportional to the target value of the inventory. As this target value is typically not known before the inventory, forest variables that are correlated to this value are used as stratification criteria. In large managed forest areas age classes or forest types might be for example good stratification criteria if the estimation of volume per ha is targeted.


Contents

[edit] Arguments for stratification

Sometimes it is useful to subdivide the population of interest in a number of sub-populations and carry out an independent sampling in each of these strata. There are statistical as well as practical considerations that make this technique very favorable and interesting for large area forest inventories. Not without reason almost all national forest invetories are based on stratification.

Statistical justifications
  • The spatial distribution of sample points inside the population is more evenly, if these points are selected in single strata,
  • It is possible to make an individual optimization of sampling and plot design for each stratum,
  • One usually increases the precision of the estimations for the total population,
  • Separate estimations for each of the strata are produced in a pre-planned manner,
  • It is guaranteed that there are actually sufficient observations in each one of the strata,
  • It is possible to produce estimations with defined precision level for sub-populations.
Practical justification
  • The possibility to optimize the inventory design separately for each stratum is very efficient and helps to minimize costs,
  • To facilitate inventory work (particularly field work): independent field campaigns can be carried out in each stratum,
  • It allows a better specialization of field crews (e.g. botanists).

[edit] Stratification criteria

For the partitioning of a population we can imagine very different stratification criteria. If the reason for stratification is not an improvement of the accuracy of estimations, the stratification variables must not necessarily be correlated to the target value. Under special conditions it might be useful to stratify even if there are no statistical justifications. For example might a political boundary dividing a forest area be a good reason for stratification, even if the forest is very homogeneous, if afterwards estimations for both parts should be derived seperately. Other examples for meaningfull criteria are:

Geographical startification
  • Eco zones,
  • Forest types,
  • Site and soil types,
  • Topographical conditions,
  • Political boundaries or properties,
  • ...

Further one can imagine to use the expected inventory costs (regarding time consumption) as criterion. These costs are typically correlated to the above mentioned geographical conditions. It might be for example reasonable to stratify a forest area by slope classes, if time consumption for field work differs significantly between flat terrain and steep slopes.

Subject matter stratification
  • Species,
  • Species groups (e.g. commercial / non-commercial),
  • Tree sociological classes,
  • Age classes in plantation forests,
  • ...

[edit] Statistics

The only new concept that needs to be introduced in stratified random sampling is how to combine the estimates derived for different strata. The estimators for stratified random sampling are based on simple considerations about linear combinations (Kleinn 2007[1]). When we have two independent random variables \(Y_1\,\) and \(Y_2\,\) and if we are interested in the sum of the two \(Y_1+Y_2\,\), then

\[E(Y_1+Y_2)=E(Y_1)+E(Y_2)\,\] and

\[var(Y_1+Y_2)=var(Y_1)+var(Y_2)\,\]


info.png Simple:
The expected value for the sum of both is equal to the sum of both single expected values. It is intuitively clear that we can sum up the totals to derive an overall total. It is different if we consider the variables to be means! In this case we have to weight the single means according to the size of the strata.

If we consider \(Y_1\,\) and \(Y_2\,\) as estimations from the two strata 1 and 2, then we can apply the principles derived from these considerations for stratified sampling.

However, to calculate the overall mean from two estimations, we have to weight the single means in order to account for possibly different sizes of the sub-populations \(N_1\,\) and \(N_2\,\). If both strata are of the same size, we can calculate the mean by:

\[\frac 12 (Y_1+Y_2)=\frac 12 Y_1+\frac 12Y_2=c_1Y_1+c_2Y_2\,\]

The factor \(c_i\,\) can be interpreted as a weight for the single estimations from stratum 1 and 2. Because both are of equal size in this example \(c_1=c_2\,\) holds. A more typical case would be that we deal with strata of unequal sizes.


info.png Example:
Weighting of single partial results (or estimations) is important, if they stem from different sized sub-populations and we want to calculate a mean. A simple example: You should calculate the mean body weight of 50 students in a classroom. You have a mean value derived for the 15 ladies (55 Kg) and a mean body weight for the 35 men (73 Kg). If you would calculate an unweighted mean (64 Kg) it would be wrong. Correct is 15/50*55+35/50*73=67,6 Kg! The weights 15/50 and 35/50 are an expression of the share of the respective group on the total population. This weight is equal to the selection probability for simple random sampling.


The weights must be proportional to the size of the sub-populations. In case of forest inventories the population is typically the total forest area (we are selecting sample points from this continuum), so that the individual stratum areas can be used to derive weighting factors. The sum of all weights must be 1, so that:

\[\sum c_i=1\,\] The expected value E for different sized strata is then:

\[E(c_1Y_1+c_2Y_2)=E(c_1Y_1)+E(c_2Y_2)=c_1E(Y_1)+c_2E(Y_2)\,\] , where \(c_1 \not= c_2\,\) or

\[E(\sum c_iY_i)=\sum c_iE(Y_i)\,\]

and the variance:

\[var(c_1Y_1+c_2Y_2)=var(c_1Y_1)+var(c_2Y_2)=c_1^2var(Y_1)+c_2^2var(Y_2)\,\] or

\[var(\sum c_iY_i)=\sum c_i^2var(Y_i)\,\]


info.png Note:
If we expand (or like in this case relate) the variance with a constant factor, this factor must be squared, because variance is a squared measure!

[edit] Notation

\(L\,\) Number of strata \(h=1, ... , L \,\)
\(N\,\) Total population size
\(N_h\,\) Size of stratum \(h (N=\sum N_h)\,\)
\(\bar y\,\) Estimated population mean
\(\bar y_h\,\) Estimated mean of stratum \(h\,\)
\(n\,\) Total sample size
\(n_h\,\) Sample size in stratum \(h\,\)
\(S^2_h\,\) Sample variance in stratum \(h\,\)
\(\tau\,\) Total
\(\tau_h\,\) Total in stratum \(h\,\)
\(\hat \tau_h\,\) Estimated total in stratum \(h\,\)
\(c_h\,\) Relative share of stratum \(h\,\) or weight of stratum
\(\hat {var} (\bar y)\,\) Estimated error variance
\(\hat {var} (\hat \tau)\,\) Estimated error variance of the total
\(k_h\,\) Estimated costs for the assessment in stratum \(h\,\)

[edit] Estimator for the mean

The estimator for the mean for stratified random sampling is derived based on the considerations above and analog to the estimator of simple random sampling as

\[\bar y = \sum_{h=1}^L \frac{N_h}{N} \bar y_h = \frac {1}{N} \sum_{h=1}^L N_h \bar y_h\,\]

[edit] Estimator for the error variance

The estimator for the error variance (selection without replacement) is:

\[\hat {var} (\bar y) = \sum_{h=1}^L \left\lbrace \left( \frac {N_h}{N} \right)^2 \hat {var} (\bar y_h) \right\rbrace = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {N_h-n_h}{N_h} \frac {S^2_h}{n_h}\]

In this case \(N_h-n_h/N_h\,\) is a finit population correction that is necessary if the strata are small and/or the sample size is large. It is typically applied if the relation between sample size and population is larger than 0.05 (Akca 2001[2]).


info.png Note:
A finite population correction is important in case that we apply a selection without replacement and the population size is significantly reduced by drawing the samples. As consequence the selection probabilities are changing with every sample we draw (because the remaining population decreases) what is corrected by this factor.


This estimator looks complex, but can be read as the weighted linear combination of simple random sampling estimators applied to the strata. The weighting factor is in this case

\[c^2=\frac {N_h^2}{N^2}\,\]

Without finite population correction the error variance is:

\[\hat {var} (\bar y) = \frac{1}{N^2} \sum_{h=1}^L N^2_h \frac {S^2_h}{n_h}\]

[edit] Estimator of the total

\[\hat\tau = N\bar y = \sum_{h=1}^L \hat \tau_h = \sum_{h=1}^L N_h \bar y_h\,\]

The estimated error variance of the estimated population total follows then with:

\[\hat{var}(\hat {\tau}) = \hat{var}(N \bar y) = N^2 \hat{var}(\bar y)\]

To calculate the confidence interval for the estimation we need information about the standard error and the sample size. In stratified random sampling one faces the difficulty with the number of dergees of freedom. While it is a direct function of sample size (\(DF=n-1\,\)) in simple random sampling, we deal with \(h\,\) different sample sizes that are combined. If the variances among these strata differ significantly it is not possible to simply join the degrees of freedom from different strata. This statistical problem is known as Behrens-Fischer problem or Welch-Satterthwaite problem. These statisticians proposed an approximation formula that can be used to calculate the "effective number of degrees of freedom" so that t-statistics can be approximately correct applied.

\[DF_e=\frac {\left(\sum_{h=1}^L g_h S_h^2 \right)^2}{\sum_{h=1}^L \frac {g_h^2 S_h^4}{n_h-1}}\,\]

[edit] Sample size

To determine the necessary sample size that is always dependent on the error probability, the allowed error and the variability inside the population, we have to consider that the variance inside the different strata is different. These different variances must be weighted to calculate the necessary sample size.


info.png Note:
The "necessary" sample size is the estimated number of samples we need to derive an estimation with a defined confidence interval. This interval is determined by the predetermined error probability \(\alpha\,\), the corresponding t-value from the t-distribution and the allowed error we define for the inventory. Typically this error is 10% for standard forest inventories.

Compare the following formula with the formula provided for simple random sampling:

\[n = \frac {t^2 \sum \frac {N^2_h S^2_h}{c_h}}{N^2 A^2}\,\]

where \(c_h = n_h/n\,\), or the share of samples that is in stratum \(h\,\).


info.png Note:
To calculate the overall sample size it is obviously necessary to know the share of samples in the different strata?! That sounds strange, because we like to calculate the sample size here. If we consider that the sample size in each stratum is influencing the expected error, it is clear that we need to have this information. Thats why we have to define the allocation scheme before we start.

In all sample size calculations one has to know before how to assign the total number of samples to the different strata: the set of \(c_h\,\) must be predetermined!

In small populations (or large samples) we have to consider the finite population correction and the above formula becomes:

\[n=\frac {\sum_{h=1}^L \frac {N_h^2 S_h^2}{c_h}}{\frac {N^2 A^2}{t^2}+ \sum_{h=1}^L N_h S_h^2}\,\]

[edit] Allocation of sample size to the strata

Three factors are relevant in respect to the decision of how to allocate the samples to the strata:

  • Stratum sizes (The bigger the stratum the more samples)
  • Variability inside the strata (The more variability the more samples)
  • Inventory costs that might vary between strata (The more costly the fewer samples).

In case that all strata are of same size (equal area) and the variability is also equal, we can use

\[n_h = \frac {n}{L}\,\]

so a uniform allocation of samples to the strata. As mentioned above this situation is not realistic and further stratification would not be superior in this case.

If the number of samples per stratum should be determined proportional to the stratum size (e.g. area) we have a proportional allocation with:

\[n_h = n \frac {N_h}{N}\,\]

In this case one would ignore possibly different variances inside the strata. If we like to consider the variances, we need some prior information about the conditions inside the strata that is sometimes available from case studies or forest management data. In this case one can apply the Neyman allocation:

\[n_h = n \frac {N_h S^2_h}{\sum_{h=1}^L N_h S^2_h}\,\]

If one has to consider inventory costs that might vary significantly between strata, this information can be included as additional factor. In this case one is able to calculate the optimal allocation with cost-minimization:

\[n_h = n \frac {\frac {N_h S^2_h}{\sqrt {k_h}}}{\sum_{h=1}^L \frac{N_h S^2_h}{\sqrt {k_h}}}\,\]


info.png Note:
It is obvious that we need \(n\) to calculate the allocation. At the same time we need \(n_h\) (the result of this calculation) to determine the sample size. This dilemma can be solved by an iterative process, where we predetermine relative shares (\(c_h\)) to derive \(n\) in a first step.

[edit] Summarizing

Depending on the allocation scheme that should be used one needs the following information to implement stratified random sampling:

  • number of strata,
  • size of strata or relative share on the total population,
  • estimations for the variance inside the strata,
  • eventually information about the expected inventory costs in different strata.

Further one has to predefine the target precision (A) (that is the allowed error expressed as 1/2 of the confidence interval width), the error probability \(\alpha\) that is typically 0.05 and the corresponding t-value from the t-distribution (in case of sample size > 30 this value is approximately 2).

Based on these Informations one may

  • choose an appropriate allocation scheme,
  • derive the weights \(c_h\,\) for the strata,
  • calculate the sample size,
  • and allocate them to the strata.


Exercise.png Stratified sampling examples: Example for Stratified sampling

[edit] Comments

Stratification is a powerful procedure to reduce the error variance if the above mentioned preconditions (homogeneous strata with significant differences of strata means) is fulfilled. In this case one takes a maximum of variability out of the sample. Some a priori information is necessary as the subdivision of the population must be defined before the samples are taken. If these information is missing one may apply techniques like double sampling for stratification that employs stratification without requiring a priori delineation of strata (the strata sizes are estimated in the course of a two-phase sampling process). In this chapter stratified random sampling was introduced. However stratification can also be applied for other sampling techniques. It is important to note that one may apply whatever sampling technique that is appropriate inside the different strata. That means for example that one can optimize the Inventory design as well as the Plot design according to the conditions inside the strata.


info.png Example:
A forest area can be subdivided in different stands of different species or age classes. It is now possible to apply different plot designs (e.g. different sizes of sample plots) in the respective strata. While we perhaps need larger plots for the old stands, smaller plots might be sufficient in younger and more dense stands.

[edit] References

  1. 1.0 1.1 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
  2. 2.0 2.1 Akca, A. 2001. Waldinventur. J.D. Sauerländer's Verlag. Frankfurt am Main, 193 S.
  3. de Vries, P.G., 1986. Sampling Theory for Forest Inventory. A Teach-Yourself Course. Springer. 399 p.

Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export