Double sampling

Latest revision as of 08:05, 6 March 2014

For the ratio estimator and the regression estimator it is stipulated, that the parametric mean or the true total of the ancillary variable needs to be known, in order to apply those estimators. In some cases, this is a very unpleasant situation, because the population values might not be known. A way out is to also estimate these values. This is exactly what double sampling is about, also referred to as two-phase sampling: in a first phase, the ancillary variable is estimated, usually with a relatively large sample of a variable that is relatively easy and inexpensive to observe. Then, in a second phase, a smaller sample is taken of the target variable, which is frequently a variable much more expensive or difficult to observe; simultaneously, however, also the ancillary variable is observed, so that a relationship between target and ancillary variable can be established (either a ratio in the case of double sampling with the ratio estimator or a regression in the case of double sampling with the regression estimator). Here, the correlation to the ancillary variable is also used to reduce the sample size in the second phase ^[1].

Double sampling can either be carried out with dependent phases or with independent phases. Dependent phases are there, when the second phase sample is a sub-sample of the first phase sample. That is: a sub-set of randomly selected samples of the first phase is re-visited and in addition to the ancillary variable the target variable is observed. In the case of independent phases, the second phase sample has nothing to do with what had been sampled in the first phase. In that case the ancillary variable has also newly to be observed.

The idea of two-phase sampling as presented in this chapter can also be extended to more than two phases. However, the more phases, the more complex the estimators.

Important:: Do not confuse two-phase sampling with two-stage sampling.It is a completely different concept that bases on the subdivision of the population in primary and secondary units.

In addition to double sampling with ratio or regression estimator, there is a third variation of double sampling, some times used in forest inventory: double sampling for stratification.

[edit] Double sampling for stratification (DSS)

[edit] General remarks

In the article on stratified random sampling it was mentioned, that there are occasions in which it is not possible or too difficult to make a clear delimitation of strata before sampling. In those cases, a so-called post-stratification can be done, or the stratification is integrated into the sampling process. And this is exactly what double sampling for stratification does: in the first phase, a relatively large sample is taken and the only variable observed is to which stratum the samples belong – whatever the criteria are that are to be used for stratification. The first phase, therefore, serves to estimate the strata sizes. We may say that in the first phase per sample point a categorical variable is observed which can take on L different values, the number of strata to be distinguished. This is the ancillary variable of the first phase.

In the second phase, a stratified sub-sample is taken from the first phase samples. This is obviously sampling with dependent phases because the value of the ancillary variable is used to guide the second phase stratified sampling. The target variable is then observed on these second phase samples, and estimation is done along the estimators for stratified sampling which must now, obviously, contain further components that account for the estimation error in strata size determination ^[1].

In double sampling for stratification, strata sizes need not to be known before sampling starts. In many cases, the number and type of strata are defined; but even that can be done during the first phase analysis process.

Example: In an open forest a stratification shall be done according to crown cover one could observe crown cover in the first phase samples and then decide in the analysis process (when the frequency distribution of crown cover values is known) how many strata to distinguish along which crown cover thresholds?

[edit] Notation

Notation in double sampling for stratification resembles that for stratified random sampling, but the two phase feature must come in:

\(L\,\)	Number of Strata;
\(n'\,\)	Total number of samples in the first phase;
\(n'_{h}\,\)	Numbers of samples in h stratum in the first phase;
\( w'_{h}\,\)	Weight of stratum h;
\( \bar y_h\)	Etimated mean od target variable Y in stratum h;
\( \bar y\)	Estimated mean of the target variable Y for entire area of interest;
\(s^2_{h}\)	Estimated variance of the target variable Y within \(h^{th}\) stratum

[edit] Estimators

The relative size of stratum h = the stratum weight as estimated from the first phase, is

\[w'_h = \frac {n'_h}{n'}\]

and then the estimated mean of the target variable Y for entire area of interest

\[\bar y = \sum_{h=1}^L w'_h \bar y_h\]

This estimator corresponds to the estimator in stratified random sampling; the only notable difference is, that strata weights are also random variables here, that is, the variable weight carries also a sampling error because it is estimated.

The estimated error variance is then

\[v \hat ar(\bar y)=\sum_{h=1}^L \left ({w'_h}^2 * \frac {{s'_h}^2}{n'_h} + w'_h * \frac {(\bar y_h - \bar y)^2}{n'} \right)\]

where we neglect the finite population correction assuming that we deal with large populations and relatively small samples compared to the population size. The first term in parenthesis is known from error variance estimation for stratified random sampling. The second term is new and comes in because strata sizes are only estimated; it is easy to understand that the error variance must be greater when the stratum sizes are estimated and not known.

Double sampling examples: Examples of application

[edit] References

↑ ^1.0 ^1.1 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.

[kleinn2007-0] 1.0 ^1.1 Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.

[1]

Double sampling

Latest revision as of 08:05, 6 March 2014

Contents

[edit] Observe:

[edit] Double sampling for stratification (DSS)

[edit] General remarks

[edit] Notation

[edit] Estimators

[edit] References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Development

Toolbox

Print/export

@@ Line 1: / Line 1: @@
-{{Content Tree|HEADER=Forest Inventory lecturenotes|NAME=Forest Inventory lecturenotes}}
+{{Ficontent}}
+For the [[ratio estimator]] and the [[regression estimator]] it is stipulated, that the parametric mean or the true total of the [[ancillary variable]] needs to be known, in order to apply those estimators. In some cases, this is a very unpleasant situation, because the [[population]] values might not be known. A way out is to also estimate these values. This is exactly what double sampling is about, also referred to as two-phase sampling: in a first phase, the ancillary variable is estimated, usually with a relatively large sample of a variable that is relatively easy and inexpensive to observe. Then, in a second phase, a smaller sample is taken of the [[target variable]], which is frequently a variable much more expensive or difficult to observe; simultaneously, however, also the ancillary variable is observed, so that a relationship between target and ancillary variable can be established (either a ratio in the case of double sampling with the [[ratio estimator]] or a regression in the case of double sampling with the regression estimator). Here, the correlation to the ancillary variable is also used to reduce the [[sample size]] in the second phase <ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes  for the Teaching Module Forest Inventory. Department of Forest Inventory   and Remote Sensing. Faculty of Forest Science and Forest Ecology,  Georg-August-Universität Göttingen. 164 S.</ref>.
-==Introduction==
-For the ratio estimator and the regression estimator we stipulated, that the parametric mean or the true total of the ancillary variable need to be known, in order to apply those estimators. In some cases, this is a very unpleasant situation, because the population values might not be known. A way out is to also estimate these values. This is exactly what double sampling is about, also referred to as two-phase sampling: in a first phase, the ancillary variable is estimated, usually with a relatively large sample of a variable that is relatively easy and inexpensive to observe. Then, in a second phase, a smaller sample is taken of the target variable, which is frequently a variable much more expensive or difficult to observe; simultaneously, however, also the ancillary variable is observed, so that a relationship between target and ancillary variable can be established (either a ratio in the case of double sampling with the [[Ratio estimator|ratio estimator]] or a regression in the case of double sampling with the regression estimator). Here, the correlation to the ancillary variable is also used to reduce the sample size in the second phase.
 ====Observe:====
+*Here, we deal with double sampling, with [[simple random sampling]] in both phases. The estimators given here are valid only for that [[:category:Sampling design|sampling design]]. If other sampling designs are used, or different designs in the two phases, the corresponding estimators must be searched for or developed.
-<blockquote>
-*Here, we deal with double sampling, with [[Simple random sampling|simple random sampling]] in both phases. The estimators given here are valid only for that [[Sampling design|sampling design]]. If other sampling designs are used, or different designs in the two phases, the corresponding estimators must be searched for or developed.
-<br>
 *Double sampling can either be carried out with dependent phases or with independent phases. Dependent  phases are there, when the second phase sample is a sub-sample of the first phase sample. That is: a sub-set of [[Random selection|randomly selected]] samples of the first phase is re-visited and in addition to the ancillary variable the target variable is observed. In the case of independent phases, the second phase sample has nothing to do with what had been sampled in the first phase. In that case the ancillary variable has also newly to be observed.
-<br>
+*The idea of two-phase sampling as presented in this chapter can also be extended to more than two phases. However, the more phases, the more complex the estimators.
-*Do not confuse two-phase sampling with two-stage sampling. It is a completely different concept that bases on the subdivision of the population in primary and secondary units.
+{{info
+|message=Important:
+|text=
+Do not confuse two-phase sampling with two-stage sampling.It is a completely different concept that bases on the subdivision of the population in primary and secondary units.
+}}
-<br>
-*The idea of two-phase sampling as presented in this chapter can also be extended to more than two phases. However, the more phases, the more complex the estimators.
+In addition to [[double sampling with ratio or regression estimator]], there is a third variation of double sampling, some times used in forest inventory: double sampling for stratification.
-</blockquote>
-In addition to double sampling with the ratio estimator and double sampling with the [[The ratio estimator|regression estimator]], there is a third variation of double sampling, some times used in forest inventory: double sampling for stratification.
 ==Double sampling for stratification (DSS)==
 ===General remarks===
-In the article on [[Stratified sampling|stratified random sampling]] it was mentioned, that there are occasions in which it is not possible or too difficult to make a clear delimitation of strata before sampling. In those cases, a so-called post-stratification can be done, or the [[Stratified sampling|stratification]] is integrated into the sampling process. And this exactly what double sampling for stratification does: in the first phase, a relatively large sample is taken and the only variable observed is to which stratum the samples belong – whatever the criteria are that are to be used for stratification. The first phase, therefore, serves to estimate the strata sizes. We may say that in the first phase per sample point a categorical variable is observed which can take on ''L'' different values, the number of strata to be distinguished. This is the ancillary variable of the first phase.
+In the article on [[Stratified sampling|stratified random sampling]] it was mentioned, that there are occasions in which it is not possible or too difficult to make a clear delimitation of strata before sampling. In those cases, a so-called post-stratification can be done, or the [[Stratified sampling|stratification]] is integrated into the sampling process. And this is exactly what double sampling for stratification does: in the first phase, a relatively large sample is taken and the only variable observed is to which stratum the samples belong – whatever the criteria are that are to be used for stratification. The first phase, therefore, serves to estimate the strata sizes. We may say that in the first phase per sample point a categorical variable is observed which can take on ''L'' different values, the number of strata to be distinguished. This is the ancillary variable of the first phase.
+In the second phase, a stratified sub-sample is taken from the first phase samples. This is obviously sampling with dependent phases because the value of the ancillary variable is used to guide the second phase stratified sampling. The target variable is then observed on these second phase samples, and estimation is done along the estimators for [[Stratified sampling|stratified sampling]] which must now, obviously, contain further components that account for the estimation error in strata size determination <ref name="kleinn2007" />.
-In the second phase, a stratified sub-sample is taken from the first phase samples. This is obviously sampling with dependent phases because the value of the ancillary variable is used to guide the second phase stratified sampling. The target variable is then observed on these second phase samples, and estimation is done along the estimators for [[Stratified sampling|stratified sampling]] which must now, obviously, contain further components that account for the estimation error in strata size determination.
+In double sampling for stratification, strata sizes need not to be known before sampling starts. In many cases, the number and type of strata are defined; but even that can be done during the first phase analysis process.
-In double sampling for stratification, strata sizes need not to be known before sampling starts. In many cases, the number and type of strata are defined; but even that can be done during the first phase analysis process: if, for example, in an open forest a stratification shall be done according to crown cover one could observe crown cover in the first phase samples and then decide in the analysis process (when the frequency distribution of crown cover values is known) how many strata to distinguish along which crown cover thresholds.
+{{info
+| message=Example
+| text=In an open forest a stratification shall be done according to crown cover one could observe crown cover in the first phase samples and then decide in the analysis process (when the frequency distribution of crown cover values is known) how many strata to distinguish along which crown cover thresholds?
+}}
 ==Notation==
 Notation in double sampling for stratification resembles that for stratified random sampling, but the two phase feature must come in:
-{|
+{|
 |-
 | <math>L\,</math> || Number of Strata;
@@ Line 67: / Line 59: @@
 ==Estimators==
 The relative size of stratum h = the stratum weight as estimated from the first phase, is
-::<math>w'_h = \frac {n'_h}{n'}</math>
+:<math>w'_h = \frac {n'_h}{n'}</math>
 and then the estimated mean of the target variable Y for entire area of interest
-::<math>\bar y = \sum_{h=1}^L w'_h \bar y_h</math>
+:<math>\bar y = \sum_{h=1}^L w'_h \bar y_h</math>
 This estimator corresponds to the estimator in stratified random sampling; the only notable difference is, that strata weights are also random variables here, that is, the variable weight carries also a sampling error because it is estimated.
@@ Line 82: / Line 73: @@
 The estimated error variance is then
-::<math>v \hat ar(\bar y)=\sum_{h=1}^L \left ({w'_h}^2 * \frac {{s'_h}^2}{n'_h} + w'_h * \frac {(\bar y_h - \bar y)^2}{n'} \right)</math>
+:<math>v \hat ar(\bar y)=\sum_{h=1}^L \left ({w'_h}^2 * \frac {{s'_h}^2}{n'_h} + w'_h * \frac {(\bar y_h - \bar y)^2}{n'} \right)</math>
@@ Line 93: / Line 84: @@
 }}
-<br>
-=Double sampling with ratio or regression estimator=
-<br>
+==References==
-The general procedure for both double sampling with the ratio estimator and for double sampling with the regression estimator is identical and has been outlined yet in the introductory section ‎5.7.1. Contrary to double sampling for stratification where a categorical variable was observed in the first phase, it is usually metric variables that serve as ancillary variables when double sampling with the ratio or regression estimator is being used.
-In the first phase, a sample of size n’ is taken to estimate the mean or total of the ancillary variable X. The sample taken is usually large because measurement of X is cheap, fast and easy. In the second phase, a sample is selected on which both target and ancillary variable are observed; from these pairs of observations, a relationship between the two variables can be established, either a ratio or a regression. The second phase sample is usually small because the observation of Y is usually more expensive, difficult and time consuming. Then, the observations from the first phase are used to estimate the total and mean of the target variable for the entire area of interest.
-In both approaches, dependent or independent phases are possible and the corresponding estimators need to be used.
+<references/>
-==Notations==
-<blockquote>
+[[Category:Sampling design]]
-{|
-|-
-|<math>N\,</math> || Total number of samples in the entire area of interest;
-|-
-|<math>n'\,</math> || Number of samples in the first phase;
-|-
-|<math>n\,</math> || Number of samples in the second phase;
-|-
-|<math>\bar y_{md.r}</math> || Estimated mean of target variable ''Y'' from the ratio estimator for entire area;
-|-
-|<math>\bar y_{md.reg}</math> || Estimated mean of target variable ''Y'' from regression estimator for entire area;
-|-
-|<math>\bar x'</math> || Estimated mean of ancillary variable ''X''in the first phase:
-|-
-|<math>\bar x</math> || Estimated mean of ancillary variable ''X'' in the second phase;
-|-
-|<math>\bar y</math> || Estimated mean of target variable ''Y'' in the second phase;
-|-
-|<math>y_i\,</math> || Observed value of target variable ''Y'';
-|-
-|<math>r\,</math> || Estimated ratio of the ratio estimator
-|-
-|<math>b\,</math> || Estimated slope coefficient of regression estimator;
-|-
-|<math>s_y^2</math> || Estimated variance of the target variable ''Y'';
-|-
-|<math>{s'_x}^2</math> || Estimated variance of ancillary variable ''X'' in the first phase;
-|-
-|<math>s_{xy}\,</math> || Estimated covariance of ''Y'' and ''X'' in the second phase;
-|-
-|<math>\hat  \rho</math> ||Estimated coefficient of correlation of ''Y'' and ''X''.
-|}
-</blockquote>
-==Estimators==
-<br>
-The following estimators are for dependent phases only. For independent phases and detailed description of other estimators, readers should refer to the standard textbooks of sampling for forest inventory or sampling in general, for example Cochran (1977), deVries (1986), Lohr (1999), Gregoire et al. (1993) or Gregoire and Valentine (2007).
-For the ''ratio estimator'', the mean of the target variable is estimated as
-::<math>\bar y_{md.r} = \frac {\bar y}{\bar x} * \bar x' = r\bar x'</math>
-with an estimated variance of the estimated mean of
-::<math>v\hat ar (\bar y_{md.r}) = \frac {s_y^2 + r^2{s'_x}^2 - 2rs_{xy}}{n} + \frac {2rs_{xy} - r^2{s'_x}^2}{n'} - \frac{s_y^2}{N}</math>
-And for the ''regression estimator'', the mean is estimated as
-::<math>\bar y_{md.reg} = \bar y + b(\bar x' - \bar x)</math>
-with an estimated variance of the estimated mean of
-::<math>v\hat ar(\bar y_{md.reg}) = \frac {s_y^2}{n} \left \{ 1 - \frac {n' - n}{n'} \hat \rho^2 \right \} </math>
-where
-::<math>s_y^2 = \frac {\sum_{i=1}^n (y_i - \bar y)^2}{n-1}</math>
-for both cases the error variance of the total is calculated as usual as
-::<math>v\hat ar(\hat \tau) = N^2 v\hat ar(\bar y)</math>
-==Overall efficiency==
-<br>
-Overall efficiency depends on the cost relation between observing phase 1 and phase 2 samples and on the correlation between the two variables. In fact, we strive to exploit the ancillary variable as much as possible to be able to reduce the number of (costly) second phase samples. In the forest inventory textbook of Shiver and Borders (1996) there is an instructive table which illustrates this relationship (see Table 20).
-<blockquote>
-{|
-| align="left |'''Table 20.''' The cost relation and the correlation  determine the  optimal ratio of first and second phase samples. Here an  example for  double sampling with regression estimation and dependent  phases (from  Shiver and Borders 1996) is given: the figures are the %  value of first  phase samples that is to be taken also as second phase  sample.
-{| cellspacing="0" border="1" cellpadding="5"
-|-
-|Relative Cost
-|colspan="6" align="center" |Correlation coefficient (''r'')
-|-
-! width="100pt" |<math>C_{n'}:C_n</math>
-! width="100pt" |0,5
-! width="100pt" |0,6
-! width="100pt" |0,7
-! width="100pt" |0,8
-! width="100pt" |0,9
-! width="100pt" |0,95
-|-
-|align="center" |1:5
-|align="center"| 77
-|align="center"| 60
-|align="center"| 46
-|align="center"| 36
-|align="center"| 22
-|align="center"| 15
-|-
-|align="center" |1:10
-|align="center" |55
-|align="center" |42
-|align="center" |32
-|align="center" |24
-|align="center" |15
-|align="center" |10
-|-
-|align="center" |1:15
-|align="center" |45
-|align="center" |34
-|align="center" |26
-|align="center" |19
-|align="center" |13
-|align="center" |8
-|-
-|align="center" |1:20
-|align="center" |39
-|align="center" |30
-|align="center" |23
-|align="center" |17
-|align="center" |11
-|align="center" |7
-|-
-|align="center" |1:30
-|align="center" |32
-|align="center" |24
-|align="center" |19
-|align="center" |14
-|align="center" |9
-|align="center" |6
-|-
-|align="center" |1:50
-|align="center" |24
-|align="center" |19
-|align="center" |14
-|align="center" |11
-|align="center" |7
-|align="center" |5
-|-
-|align="center" |1:100
-|align="center" |17
-|align="center" |13
-|align="center" |10
-|align="center" |7
-|align="center" |5
-|align="center" |3
-|}
-|}
-</blockquote>
-Two basic features are noticed in Table 20: (1) the more expensive the second phase samples in relation to the first phase sample, for a given correlation coefficient, the more samples are in the first phases; and (2) for given cost relation, less second phase samples need to be taken when the correlation is higher.
-As a consequence, similar to what we said for the ratio estimator: if one makes it to identify an ancillary variable which is well correlated to the target variable, one can save cost and possibly gain precision at the same time.
-{{Exercise
-|message=Double sampling with ratio or regression estimator examples
-|text=Examples of application
-}}