Double sampling with ratio or regression estimator

From AWF-Wiki
(Difference between revisions)
Jump to: navigation, search
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<seo title="" metakeywords="double sampling,ratio estimator,regression estimator,stratification,sampling"  metadescription="a two-stage sampling approach using an acilliary  variable to gain knowledge about a target variable through the application of ratio or regression estimator" />
+
{{Ficontent}}
{{Content Tree|HEADER=Forest Inventory lecturenotes|NAME=Forest Inventory lecturenotes}}
+
The  general procedure for both [[Double sampling|double sampling]] with the [[Ratio  estimator|ratio estimator]] and for double sampling with the  [[Ratio_estimator#Regression_estimator|regression estimator]] is identical. Contrary to  double sampling for stratification where a categorical variable is observed in the first phase, it is usually metric variables that serve  as ancillary variables when double sampling with the ratio or regression estimator is being used.  
 
+
=Double sampling with ratio or regression estimator=
+
 
+
The  general procedure for both [[Double sampling|double sampling]] with the [[Ratio  estimator|ratio estimator]] and for double sampling with the  [[Regression estimator|regression estimator]] is identical. Contrary to  double sampling for stratification where a categorical variable is observed in the first phase, it is usually metric variables that serve  as ancillary variables when double sampling with the ratio or regression estimator is being used.  
+
 
    
 
    
In the first phase, a sample  of size 'n' is taken to estimate the mean or total of the ancillary  variable X. The sample taken is usually large because measurement of X  is cheap, fast and easy. In the second phase, a sample is selected on  which both target and ancillary variable are observed; from these pairs  of observations, a relationship between the two variables can be  established, either a ratio or a regression. The second phase sample is  usually small because the observation of Y is usually more expensive,  difficult and time consuming. Then, the observations from the first  phase are used to estimate the total and mean of the target variable for  the entire area of interest.
+
In the first phase, a sample  of size 'n' is taken to estimate the mean or total of the ancillary  variable X. The sample taken is usually large because measurement of X  is cheap, fast and easy. In the second phase, a sample is selected on  which both target and ancillary variable are observed; from these pairs  of observations, a relationship between the two variables can be  established, either a ratio or a [[linear regression|regression]]. The second phase sample is  usually small because the observation of Y is usually more expensive,  difficult and time consuming. Then, the observations from the first  phase are used to estimate the total and mean of the target variable for  the entire area of interest.
 
    
 
    
In both approaches, dependent or independent phases are possible and the corresponding estimators need to be used.
+
In both approaches, dependent or independent phases are possible and the corresponding estimators need to be used <ref name="kleinn2007">Kleinn, C. 2007. Lecture Notes  for the  Teaching Module Forest Inventory. Department of Forest Inventory  and  Remote Sensing. Faculty of Forest Science and Forest Ecology,  Georg-August-Universität Göttingen. 164 S.</ref>.
 +
It is interesting to note, that double sampling is also interesting in context of [[Estimation_on_changes#Sampling_with_partial_replacement|Sampling with partial replacement]] (SPR) that is a very efficient technique to [[estimation on changes|estimate changes]].
  
  
Line 45: Line 42:
 
|<math>\hat  \rho</math> ||Estimated coefficient of correlation of ''Y'' and ''X''.
 
|<math>\hat  \rho</math> ||Estimated coefficient of correlation of ''Y'' and ''X''.
 
|}
 
|}
 
 
  
  
Line 52: Line 47:
 
   
 
   
 
The  following estimators are for dependent phases only. For independent  phases and detailed description of other estimators, readers should  refer to the standard textbooks of sampling for forest inventory or  sampling in general, for example Cochran (1977<ref>Cochran 1977.  Sampling Techniques. John Wiley & Sons, 428p</ref>), deVries  (1986<ref>de Vries PG. 1986. Sampling Theory for Forest Inventory.  Springer-Verlag Berlin. 399p.</ref>), Lohr (1999<ref>Lohr  S. 1999. Sampling Design and Analysis. Brooks/Cole Publishing Company.  494p.</ref>), Gregoire et al. (1993) or Gregoire and Valentine  (2007).
 
The  following estimators are for dependent phases only. For independent  phases and detailed description of other estimators, readers should  refer to the standard textbooks of sampling for forest inventory or  sampling in general, for example Cochran (1977<ref>Cochran 1977.  Sampling Techniques. John Wiley & Sons, 428p</ref>), deVries  (1986<ref>de Vries PG. 1986. Sampling Theory for Forest Inventory.  Springer-Verlag Berlin. 399p.</ref>), Lohr (1999<ref>Lohr  S. 1999. Sampling Design and Analysis. Brooks/Cole Publishing Company.  494p.</ref>), Gregoire et al. (1993) or Gregoire and Valentine  (2007).
 
  
 
For the ''ratio estimator'', the mean of the target variable is estimated as
 
For the ''ratio estimator'', the mean of the target variable is estimated as
 
  
 
:<math>\bar y_{md.r} = \frac {\bar y}{\bar x} * \bar x' = r\bar x'</math>
 
:<math>\bar y_{md.r} = \frac {\bar y}{\bar x} * \bar x' = r\bar x'</math>
 
  
 
with an estimated variance of the estimated mean of
 
with an estimated variance of the estimated mean of
 
  
 
:<math>v\hat  ar (\bar y_{md.r}) = \frac {s_y^2 + r^2{s'_x}^2 - 2rs_{xy}}{n} + \frac  {2rs_{xy} - r^2{s'_x}^2}{n'} - \frac{s_y^2}{N}</math>
 
:<math>v\hat  ar (\bar y_{md.r}) = \frac {s_y^2 + r^2{s'_x}^2 - 2rs_{xy}}{n} + \frac  {2rs_{xy} - r^2{s'_x}^2}{n'} - \frac{s_y^2}{N}</math>
 
  
 
And for the ''regression estimator'', the mean is estimated as
 
And for the ''regression estimator'', the mean is estimated as
 
  
 
:<math>\bar y_{md.reg} = \bar y + b(\bar x' - \bar x)</math>
 
:<math>\bar y_{md.reg} = \bar y + b(\bar x' - \bar x)</math>
 
  
 
with an estimated variance of the estimated mean of
 
with an estimated variance of the estimated mean of
 
  
 
:<math>v\hat ar(\bar y_{md.reg}) = \frac {s_y^2}{n} \left \{ 1 - \frac {n' - n}{n'} \hat \rho^2 \right \} </math>
 
:<math>v\hat ar(\bar y_{md.reg}) = \frac {s_y^2}{n} \left \{ 1 - \frac {n' - n}{n'} \hat \rho^2 \right \} </math>
 
  
 
where
 
where
 
  
 
:<math>s_y^2 = \frac {\sum_{i=1}^n (y_i - \bar y)^2}{n-1}</math>
 
:<math>s_y^2 = \frac {\sum_{i=1}^n (y_i - \bar y)^2}{n-1}</math>
 
  
 
for both cases the error variance of the total is calculated as usual as
 
for both cases the error variance of the total is calculated as usual as
 
  
 
:<math>v\hat ar(\hat \tau) = N^2 v\hat ar(\bar y)</math>
 
:<math>v\hat ar(\hat \tau) = N^2 v\hat ar(\bar y)</math>
 
  
 
==Overall efficiency==
 
==Overall efficiency==
Line 95: Line 77:
  
  
 
+
<blockquote>
 
{|
 
{|
 
| width="700pt" align="left" |'''Table 1.''' The cost relation and the correlation  determine the  optimal ratio of first and second phase samples. Here an  example for  double sampling with regression estimation and dependent  phases (from  Shiver and Borders 1996 <ref name="Schiver and Borders 1996" />) is given: the figures are the %  value of first  phase samples that is to be taken also as second phase  sample.
 
| width="700pt" align="left" |'''Table 1.''' The cost relation and the correlation  determine the  optimal ratio of first and second phase samples. Here an  example for  double sampling with regression estimation and dependent  phases (from  Shiver and Borders 1996 <ref name="Schiver and Borders 1996" />) is given: the figures are the %  value of first  phase samples that is to be taken also as second phase  sample.
Line 169: Line 151:
 
|}
 
|}
 
|}
 
|}
 +
</blockquote>
  
 
+
 
+
 
+
 
Two  basic features are noticed in Table 1: (1) the more expensive the  second phase samples in relation to the first phase sample, for a given  [[Correlation coefficient | correlation coefficient]], the more samples  are in the first phases; and (2) for given cost relation, less second  phase samples need to be taken when the correlation is higher.
 
Two  basic features are noticed in Table 1: (1) the more expensive the  second phase samples in relation to the first phase sample, for a given  [[Correlation coefficient | correlation coefficient]], the more samples  are in the first phases; and (2) for given cost relation, less second  phase samples need to be taken when the correlation is higher.
 
    
 
    
 
As  a consequence, similar to what can be said for the [[Ratio estimator |  ratio estimator]]: if one makes it to identify an ancillary variable  which is well correlated to the target variable, one can save cost and  possibly gain precision at the same time.
 
As  a consequence, similar to what can be said for the [[Ratio estimator |  ratio estimator]]: if one makes it to identify an ancillary variable  which is well correlated to the target variable, one can save cost and  possibly gain precision at the same time.
 
  
 
{{Exercise
 
{{Exercise

Latest revision as of 08:09, 31 October 2013

The general procedure for both double sampling with the ratio estimator and for double sampling with the regression estimator is identical. Contrary to double sampling for stratification where a categorical variable is observed in the first phase, it is usually metric variables that serve as ancillary variables when double sampling with the ratio or regression estimator is being used.

In the first phase, a sample of size 'n' is taken to estimate the mean or total of the ancillary variable X. The sample taken is usually large because measurement of X is cheap, fast and easy. In the second phase, a sample is selected on which both target and ancillary variable are observed; from these pairs of observations, a relationship between the two variables can be established, either a ratio or a regression. The second phase sample is usually small because the observation of Y is usually more expensive, difficult and time consuming. Then, the observations from the first phase are used to estimate the total and mean of the target variable for the entire area of interest.

In both approaches, dependent or independent phases are possible and the corresponding estimators need to be used [1]. It is interesting to note, that double sampling is also interesting in context of Sampling with partial replacement (SPR) that is a very efficient technique to estimate changes.


Contents

[edit] Notations

\(N\,\) Total number of samples in the entire area of interest;
\(n'\,\) Number of samples in the first phase;
\(n\,\) Number of samples in the second phase;
\(\bar y_{md.r}\) Estimated mean of target variable Y from the ratio estimator for entire area;
\(\bar y_{md.reg}\) Estimated mean of target variable Y from regression estimator for entire area;
\(\bar x'\) Estimated mean of ancillary variable Xin the first phase:
\(\bar x\) Estimated mean of ancillary variable X in the second phase;
\(\bar y\) Estimated mean of target variable Y in the second phase;
\(y_i\,\) Observed value of target variable Y;
\(r\,\) Estimated ratio of the ratio estimator
\(b\,\) Estimated slope coefficient of regression estimator;
\(s_y^2\) Estimated variance of the target variable Y;
\({s'_x}^2\) Estimated variance of ancillary variable X in the first phase;
\(s_{xy}\,\) Estimated covariance of Y and X in the second phase;
\(\hat \rho\) Estimated coefficient of correlation of Y and X.


[edit] Estimators

The following estimators are for dependent phases only. For independent phases and detailed description of other estimators, readers should refer to the standard textbooks of sampling for forest inventory or sampling in general, for example Cochran (1977[2]), deVries (1986[3]), Lohr (1999[4]), Gregoire et al. (1993) or Gregoire and Valentine (2007).

For the ratio estimator, the mean of the target variable is estimated as

\[\bar y_{md.r} = \frac {\bar y}{\bar x} * \bar x' = r\bar x'\]

with an estimated variance of the estimated mean of

\[v\hat ar (\bar y_{md.r}) = \frac {s_y^2 + r^2{s'_x}^2 - 2rs_{xy}}{n} + \frac {2rs_{xy} - r^2{s'_x}^2}{n'} - \frac{s_y^2}{N}\]

And for the regression estimator, the mean is estimated as

\[\bar y_{md.reg} = \bar y + b(\bar x' - \bar x)\]

with an estimated variance of the estimated mean of

\[v\hat ar(\bar y_{md.reg}) = \frac {s_y^2}{n} \left \{ 1 - \frac {n' - n}{n'} \hat \rho^2 \right \} \]

where

\[s_y^2 = \frac {\sum_{i=1}^n (y_i - \bar y)^2}{n-1}\]

for both cases the error variance of the total is calculated as usual as

\[v\hat ar(\hat \tau) = N^2 v\hat ar(\bar y)\]

[edit] Overall efficiency

Overall efficiency depends on the cost relation between observing phase 1 and phase 2 samples and on the correlation between the two variables. In fact, we strive to exploit the ancillary variable as much as possible to be able to reduce the number of (costly) second phase samples. In the forest inventory textbook of Shiver and Borders (1996[5]) there is an instructive table which illustrates this relationship (see Table 20).


Table 1. The cost relation and the correlation determine the optimal ratio of first and second phase samples. Here an example for double sampling with regression estimation and dependent phases (from Shiver and Borders 1996 [5]) is given: the figures are the % value of first phase samples that is to be taken also as second phase sample.
Relative Cost Correlation coefficient (r)
\(C_{n'}:C_n\) 0,5 0,6 0,7 0,8 0,9 0,95
1:5 77 60 46 36 22 15
1:10 55 42 32 24 15 10
1:15 45 34 26 19 13 8
1:20 39 30 23 17 11 7
1:30 32 24 19 14 9 6
1:50 24 19 14 11 7 5
1:100 17 13 10 7 5 3


Two basic features are noticed in Table 1: (1) the more expensive the second phase samples in relation to the first phase sample, for a given correlation coefficient, the more samples are in the first phases; and (2) for given cost relation, less second phase samples need to be taken when the correlation is higher.

As a consequence, similar to what can be said for the ratio estimator: if one makes it to identify an ancillary variable which is well correlated to the target variable, one can save cost and possibly gain precision at the same time.


Exercise.png Double sampling with ratio or regression estimator examples: Examples of application

[edit] References

  1. Kleinn, C. 2007. Lecture Notes for the Teaching Module Forest Inventory. Department of Forest Inventory and Remote Sensing. Faculty of Forest Science and Forest Ecology, Georg-August-Universität Göttingen. 164 S.
  2. Cochran 1977. Sampling Techniques. John Wiley & Sons, 428p
  3. de Vries PG. 1986. Sampling Theory for Forest Inventory. Springer-Verlag Berlin. 399p.
  4. Lohr S. 1999. Sampling Design and Analysis. Brooks/Cole Publishing Company. 494p.
  5. 5.0 5.1 Shiver BD and BE Borders. 1996. Sampling Techniques for Forest Resource Inventory. John Wiley & Sons. 356p.
Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export