Double sampling

From AWF-Wiki
(Difference between revisions)
Jump to: navigation, search
(Observe:)
(Overall efficiency)
Line 180: Line 180:
  
 
{|
 
{|
| width="700pt" align="left" |'''Table 20.''' The cost relation and the correlation  determine the  optimal ratio of first and second phase samples. Here an  example for  double sampling with regression estimation and dependent  phases (from  Shiver and Borders 1996 <ref name="Schiver and Borders 1996" />) is given: the figures are the %  value of first  phase samples that is to be taken also as second phase  sample.
+
| width="700pt" align="left" |'''Table 1.''' The cost relation and the correlation  determine the  optimal ratio of first and second phase samples. Here an  example for  double sampling with regression estimation and dependent  phases (from  Shiver and Borders 1996 <ref name="Schiver and Borders 1996" />) is given: the figures are the %  value of first  phase samples that is to be taken also as second phase  sample.
 
{| cellspacing="0" border="1" cellpadding="5"
 
{| cellspacing="0" border="1" cellpadding="5"
 
|-
 
|-
Line 256: Line 256:
  
 
    
 
    
Two basic features are noticed in Table 20: (1) the more expensive the second phase samples in relation to the first phase sample, for a given correlation coefficient, the more samples are in the first phases; and (2) for given cost relation, less second phase samples need to be taken when the correlation is higher.
+
Two basic features are noticed in Table 1: (1) the more expensive the second phase samples in relation to the first phase sample, for a given correlation coefficient, the more samples are in the first phases; and (2) for given cost relation, less second phase samples need to be taken when the correlation is higher.
 
    
 
    
 
As a consequence, similar to what we said for the ratio estimator: if one makes it to identify an ancillary variable which is well correlated to the target variable, one can save cost and possibly gain precision at the same time.
 
As a consequence, similar to what we said for the ratio estimator: if one makes it to identify an ancillary variable which is well correlated to the target variable, one can save cost and possibly gain precision at the same time.

Revision as of 15:58, 12 January 2011

Forest Inventory lecturenotes
Category Forest Inventory lecturenotes not found


Contents

Introduction

For the ratio estimator and the regression estimator we stipulated, that the parametric mean or the true total of the ancillary variable need to be known, in order to apply those estimators. In some cases, this is a very unpleasant situation, because the population values might not be known. A way out is to also estimate these values. This is exactly what double sampling is about, also referred to as two-phase sampling: in a first phase, the ancillary variable is estimated, usually with a relatively large sample of a variable that is relatively easy and inexpensive to observe. Then, in a second phase, a smaller sample is taken of the target variable, which is frequently a variable much more expensive or difficult to observe; simultaneously, however, also the ancillary variable is observed, so that a relationship between target and ancillary variable can be established (either a ratio in the case of double sampling with the ratio estimator or a regression in the case of double sampling with the regression estimator). Here, the correlation to the ancillary variable is also used to reduce the sample size in the second phase.

Observe:

  • Here, we deal with double sampling, with simple random sampling in both phases. The estimators given here are valid only for that sampling design. If other sampling designs are used, or different designs in the two phases, the corresponding estimators must be searched for or developed.


  • Double sampling can either be carried out with dependent phases or with independent phases. Dependent phases are there, when the second phase sample is a sub-sample of the first phase sample. That is: a sub-set of randomly selected samples of the first phase is re-visited and in addition to the ancillary variable the target variable is observed. In the case of independent phases, the second phase sample has nothing to do with what had been sampled in the first phase. In that case the ancillary variable has also newly to be observed.


  • Do not confuse two-phase sampling with two-stage sampling. It is a completely different concept that bases on the subdivision of the population in primary and secondary units.


  • The idea of two-phase sampling as presented in this chapter can also be extended to more than two phases. However, the more phases, the more complex the estimators.


In addition to double sampling with the ratio estimator and double sampling with the regression estimator, there is a third variation of double sampling, some times used in forest inventory: double sampling for stratification.

Double sampling for stratification (DSS)

General remarks

In the article on stratified random sampling it was mentioned, that there are occasions in which it is not possible or too difficult to make a clear delimitation of strata before sampling. In those cases, a so-called post-stratification can be done, or the stratification is integrated into the sampling process. And this exactly what double sampling for stratification does: in the first phase, a relatively large sample is taken and the only variable observed is to which stratum the samples belong – whatever the criteria are that are to be used for stratification. The first phase, therefore, serves to estimate the strata sizes. We may say that in the first phase per sample point a categorical variable is observed which can take on L different values, the number of strata to be distinguished. This is the ancillary variable of the first phase.

In the second phase, a stratified sub-sample is taken from the first phase samples. This is obviously sampling with dependent phases because the value of the ancillary variable is used to guide the second phase stratified sampling. The target variable is then observed on these second phase samples, and estimation is done along the estimators for stratified sampling which must now, obviously, contain further components that account for the estimation error in strata size determination.

In double sampling for stratification, strata sizes need not to be known before sampling starts. In many cases, the number and type of strata are defined; but even that can be done during the first phase analysis process: if, for example, in an open forest a stratification shall be done according to crown cover one could observe crown cover in the first phase samples and then decide in the analysis process (when the frequency distribution of crown cover values is known) how many strata to distinguish along which crown cover thresholds.


Notation

Notation in double sampling for stratification resembles that for stratified random sampling, but the two phase feature must come in:

\(L\,\) Number of Strata;
\(n'\,\) Total number of samples in the first phase;
\(n'_{h}\,\) Numbers of samples in h stratum in the first phase;
\( w'_{h}\,\) Weight of stratum h;
\( \bar y_h\) Etimated mean od target variable Y in stratum h;
\( \bar y\) Estimated mean of the target variable Y for entire area of interest;
\(s^2_{h}\) Estimated variance of the target variable Y within \(h^{th}\) stratum

Estimators

The relative size of stratum h = the stratum weight as estimated from the first phase, is

\[w'_h = \frac {n'_h}{n'}\]


and then the estimated mean of the target variable Y for entire area of interest

\[\bar y = \sum_{h=1}^L w'_h \bar y_h\]

This estimator corresponds to the estimator in stratified random sampling; the only notable difference is, that strata weights are also random variables here, that is, the variable weight carries also a sampling error because it is estimated.

The estimated error variance is then

\[v \hat ar(\bar y)=\sum_{h=1}^L \left ({w'_h}^2 * \frac {{s'_h}^2}{n'_h} + w'_h * \frac {(\bar y_h - \bar y)^2}{n'} \right)\]


where we neglect the finite population correction assuming that we deal with large populations and relatively small samples compared to the population size. The first term in parenthesis is known from error variance estimation for stratified random sampling. The second term is new and comes in because strata sizes are only estimated; it is easy to understand that the error variance must be greater when the stratum sizes are estimated and not known.


Exercise.png Double sampling examples: Examples of application


Double sampling with ratio or regression estimator


The general procedure for both double sampling with the ratio estimator and for double sampling with the regression estimator is identical and has been outlined yet in the introductory section ‎5.7.1. Contrary to double sampling for stratification where a categorical variable was observed in the first phase, it is usually metric variables that serve as ancillary variables when double sampling with the ratio or regression estimator is being used.

In the first phase, a sample of size n’ is taken to estimate the mean or total of the ancillary variable X. The sample taken is usually large because measurement of X is cheap, fast and easy. In the second phase, a sample is selected on which both target and ancillary variable are observed; from these pairs of observations, a relationship between the two variables can be established, either a ratio or a regression. The second phase sample is usually small because the observation of Y is usually more expensive, difficult and time consuming. Then, the observations from the first phase are used to estimate the total and mean of the target variable for the entire area of interest.

In both approaches, dependent or independent phases are possible and the corresponding estimators need to be used.

Notations

\(N\,\) Total number of samples in the entire area of interest;
\(n'\,\) Number of samples in the first phase;
\(n\,\) Number of samples in the second phase;
\(\bar y_{md.r}\) Estimated mean of target variable Y from the ratio estimator for entire area;
\(\bar y_{md.reg}\) Estimated mean of target variable Y from regression estimator for entire area;
\(\bar x'\) Estimated mean of ancillary variable Xin the first phase:
\(\bar x\) Estimated mean of ancillary variable X in the second phase;
\(\bar y\) Estimated mean of target variable Y in the second phase;
\(y_i\,\) Observed value of target variable Y;
\(r\,\) Estimated ratio of the ratio estimator
\(b\,\) Estimated slope coefficient of regression estimator;
\(s_y^2\) Estimated variance of the target variable Y;
\({s'_x}^2\) Estimated variance of ancillary variable X in the first phase;
\(s_{xy}\,\) Estimated covariance of Y and X in the second phase;
\(\hat \rho\) Estimated coefficient of correlation of Y and X.


Estimators


The following estimators are for dependent phases only. For independent phases and detailed description of other estimators, readers should refer to the standard textbooks of sampling for forest inventory or sampling in general, for example Cochran (1977[1]), deVries (1986[2]), Lohr (1999[3]), Gregoire et al. (1993) or Gregoire and Valentine (2007).


For the ratio estimator, the mean of the target variable is estimated as


\[\bar y_{md.r} = \frac {\bar y}{\bar x} * \bar x' = r\bar x'\]


with an estimated variance of the estimated mean of


\[v\hat ar (\bar y_{md.r}) = \frac {s_y^2 + r^2{s'_x}^2 - 2rs_{xy}}{n} + \frac {2rs_{xy} - r^2{s'_x}^2}{n'} - \frac{s_y^2}{N}\]


And for the regression estimator, the mean is estimated as


\[\bar y_{md.reg} = \bar y + b(\bar x' - \bar x)\]


with an estimated variance of the estimated mean of


\[v\hat ar(\bar y_{md.reg}) = \frac {s_y^2}{n} \left \{ 1 - \frac {n' - n}{n'} \hat \rho^2 \right \} \]


where


\[s_y^2 = \frac {\sum_{i=1}^n (y_i - \bar y)^2}{n-1}\]


for both cases the error variance of the total is calculated as usual as


\[v\hat ar(\hat \tau) = N^2 v\hat ar(\bar y)\]

Overall efficiency

Overall efficiency depends on the cost relation between observing phase 1 and phase 2 samples and on the correlation between the two variables. In fact, we strive to exploit the ancillary variable as much as possible to be able to reduce the number of (costly) second phase samples. In the forest inventory textbook of Shiver and Borders (1996[4]) there is an instructive table which illustrates this relationship (see Table 20).

Table 1. The cost relation and the correlation determine the optimal ratio of first and second phase samples. Here an example for double sampling with regression estimation and dependent phases (from Shiver and Borders 1996 [4]) is given: the figures are the % value of first phase samples that is to be taken also as second phase sample.
Relative Cost Correlation coefficient (r)
\(C_{n'}:C_n\) 0,5 0,6 0,7 0,8 0,9 0,95
1:5 77 60 46 36 22 15
1:10 55 42 32 24 15 10
1:15 45 34 26 19 13 8
1:20 39 30 23 17 11 7
1:30 32 24 19 14 9 6
1:50 24 19 14 11 7 5
1:100 17 13 10 7 5 3


Two basic features are noticed in Table 1: (1) the more expensive the second phase samples in relation to the first phase sample, for a given correlation coefficient, the more samples are in the first phases; and (2) for given cost relation, less second phase samples need to be taken when the correlation is higher.

As a consequence, similar to what we said for the ratio estimator: if one makes it to identify an ancillary variable which is well correlated to the target variable, one can save cost and possibly gain precision at the same time.


Exercise.png Double sampling with ratio or regression estimator examples: Examples of application

References

  1. Cochran 1977. Sampling Techniques. John Wiley & Sons, 428p
  2. de Vries PG. 1986. Sampling Theory for Forest Inventory. Springer-Verlag Berlin. 399p.
  3. Lohr S. 1999. Sampling Design and Analysis. Brooks/Cole Publishing Company. 494p.
  4. 4.0 4.1 Shiver BD and BE Borders. 1996. Sampling Techniques for Forest Resource Inventory. John Wiley & Sons. 356p.
Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export