Intracluster Correlation Coefficient

From AWF-Wiki
Jump to: navigation, search

The similarity of observations within a cluster can be quantified by means of the Intracluster Correlation Coefficient (ICC), sometimes also referred to as intraclass correlation coefficient. This is very similar to the well known Pearson’s correlation coefficient; only that we do not simultaneously look at observations of two variables on the same object but we look simultaneously on two values of the same variable, but taken at two different objects. As also the Pearson correlation coefficient, the parametric intra-cluster correlation coefficient is denoted with the Greek \(\bar \rho_{ICC}\) and the sample based estimation by the Latin \(r_{ICC}\). It is calculated as

\[\bar \rho_{ICC}=\frac{cov(y_p,y_q)}{\sqrt{var(y_p)} \sqrt{var(y_q)}}=\frac{cov(y_p y_q)}{var(y)}\]

In the case of the intra-cluster correlation coefficient, we are looking at one and the same variable so that \(\sqrt{var(y_p)}\) and \(\sqrt{var(y_q)}\) refer to the same variable and can be combined to \(var(y)\) .

For clusters of equal size (the case dealt with in this chapter) values of the ICC can be in the following range \(- \frac{1}{m-1} \le \bar \rho_{ICC} \le +1\). The upper limit is fixed, but the lower limit depends on the cluster size (the number of elements (sub-plots) that are combined to one cluster plot). The larger the number of sub-plots, the more close to 0 is the lower, negative limit. Through some rearranging of the error variance formula (not presented here), the intra-cluster correlation coefficient can be incorporated. Then, the error variance of the estimated mean is

\[var(\bar y)=\frac{N-n}{N-1} \frac{1}{m} \frac{1}{n} \sigma^2 \left( 1+(m-1)\bar \rho_{ICC} \right)\]

Observe, that the above formula is the parametric formula for the error variance of the estimated mean per element. Therefore, the finite population correction is considered; also, the parametric intra-cluster correlation coefficient and the parametric variance in the population occur in this formula.

info.png Note:
This error variance formula is very instructive when it comes to understand and analyze the performance of cluster sampling for populations of different covariance structure; as the covariance structure of the population (the spatial autocorrelation) is directly mirrored in the intra-cluster correlation coefficient.

Let´s look at that error variance formula for different situations of spatial autocorrelation, that is different values of the intra-cluster autocorrelation \(\bar \rho_{ICC}\).

  1. If \(\bar \rho_{ICC}=0\), then we have a situation in which the sub-plots are so distant from each other that no correlation is present. Then, the term in parenthesis becomes 1 – and the error variance is exactly the formula for simple random sampling with sample size \(nm\). In that case (which is very uncommon in applications of forest inventory) cluster sampling and simple random sampling with the same number of sub-plots is identical in what refers to statistical precision. We may use this situation as reference for the following two cases.
  1. When \(\bar \rho_{ICC}<0\), the term in parenthesis becomes smaller than 1 and the resulting error variance is smaller. While this would be very welcome, negative intra-cluster correlation coefficients are very uncommon in forest inventory!
  1. The usual case in forest inventory is that \(\bar \rho_{ICC}>0\) and that means that cluster sampling carries a larger error variance than simple random sampling with the same number of sub-plots. The planner, however, strives to keep the intra-cluster correlation as small as possible in order not to lose too much of precision.

From the cluster plot data of a forest inventory, an empirical estimation of the intra-cluster correlation coefficient can be calculated, by combining all pairs of sub-plots. If the cluster design is large and complex enough, it is also possible to derive some information about the spatial autocorrelation; that implies calculating the correlations for all pairs of sub-plots which are at a defined distance. That is, for each inter-subplot distance that occurs within the cluster, one correlation value can be calculated; if enough distances are there, sections of the covariance function can be calculated.

Cluster design planning criteria

In the preceding chapter we have seen that there are various criteria that should be taken into account when designing cluster plots for a forest inventory (or any other sampling study). As always cost and practical criteria need to be balanced with statistical performance. As with plot design planning in general, the overall objective is to design the plots such, that for a given cost the highest possible within-plot variability is achieved for the most highly prioritized target variables. The following basic “factors” need to be defined / optimized in cluster design planning in large area forest inventories, where clusters consist of a number of m sub-plots:

  • number of sub-plots;
  • type of sub-plot;
  • size of sub-plot;
  • geometrical arrangement of sub-plots;
  • distance between sub-plots.

The distance between sub-plots is either empirically defined, or one tries to derive the spatial autocorrelation function and defines the inter-plot distance then in such a manner that inter-plot correlation is small. The larger the distance between neighboring sub-plots, the smaller is usually the correlation. However, very large distances are obviously difficult in the field in many cases because of the long (and essentially inefficient) walking time so that a certain level of correlation is accepted. In many cases in large area forest inventories, the cluster design is not only optimized towards statistical performance, but it is intended that one cluster can be worked on and finished within one day. It is usually too inefficient to come back a second time to a cluster plot, because in large area inventories, distances are usually large.

Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export