Resource assessment exercises: mean, variance and standard deviation
- This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.
Formally, the population (i.e., the ten trees) will be denoted by $U$ and consists of $N=10$ elements,
\[U=\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\]
For simplicity we will let the $i$th tree be represented by its label $i$. Thus, our finite population can be written as
\[U=\{1,2,\ldots,i,\ldots,N\}\]
Attached to each tree is the value of a study variable $y$, e.g., the DBH. The value of the study variable for the $i$th element will be denoted $y_i$. The population mean of $y$ is defined as
$\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}$ | 1 |
Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh in R:
mean(trees10$dbh) ## [1] 26.5
- What the function
mean()
does - The function
mean(x)
computes the mean of a vectorx
(see equation above). The$
in the last code example is used to access a single column of adata.frame
.
This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values $y_i$ in the population $U$.
The variance, i.e., the average squared deviations of the individual values $y_i$ from their mean, is defined as,
$\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}$ | 2 |
In R:
pvar(trees10$dbh) ## [1] 136.2
- What the function
pvar()
does - The function
pvar(x)
computes the parametric variance of a vectorx
(see previous equation). Note, R standard variance functionvar()
uses $N-1$ in the denominator. Here, we divide by $N$ only because the mean is not estimated, i.e., not a random variable. The definition of the functionpvar()
is given in parametric variance.
The square-root of the variance, $\sigma^2$, gives the standard deviation,
$\sigma=\sqrt{\sigma^2}$ | 3 |
The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),
$cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times 100\quad\text{(in percent)}$ | 4 |
In R:
sqrt(pvar(trees10$dbh)) ## [1] 11.67 sqrt(pvar(trees10$dbh))/mean(trees10$dbh) ## [1] 0.4405 sqrt(pvar(trees10$dbh))/mean(trees10$dbh) * 100 ## [1] 44.05
Again, these values can only be calculated if we have access to all values $y_i$ in the population $U$. This is rarely the case. Usually we look at only a subset of $U$. This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.
The set of elements in a sample will be denoted $S$. The sample size, i.e., the number of elements in $S$, is given by $n$. In R we will take a simple random sample without replacement (SRSwoR) from $U$ of size $n=2$. Without replacement means, that once an element from $U$ has been selected it cannot be selected again.
S <- sample(x = trees10$dbh, size = 2) S ## [1] 19 14
- What the function
sample()
does - The function
sample(x, size)
takes a sample fromx
. Thesize
argument defines the elements in the sample.x
can either be a single numerical value, e.g., 10, or a vector of values. Ifsize
is larger thanx
, usereplace = TRUE
(e.g.,sample(c("A","B"), size = 10, replace = TRUE)
.
The sample mean is defined as,
$\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$ | 5 |
The difference between equation 1 and equation 5 is important. For the latter we look at only the values $y_i$ that are in the sample $S$. All other values are assumed unknown to us.
The notation we will use in the following is somewhat different from the one given in, for example, Kleinn (2013)[1]. To indicate that we look at all elements in the sample we write $i\in S$ (in words: all elements $i$ that are “element of” the sample $S$). The population and sample mean can, thus, be written as,
\[\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}\]
For the population mean, $\mu_y$, it means that we take the sum of all $y_i$s that are element of the population $U$, whereas for the sample mean we take the sum of the values $y_i$ of all elements that are element of $S$. This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation 5, for example, one could erroneously assume that we look at the first $1,2,\ldots$ elements of $U$ up to $n$. However, the elements in the sample $S$ can be any $i$ from the population, not just the first $n$ elements.
Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as
$s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}$ | 6 |
$s=\sqrt{s^2}$ | 7 |
$\hat{cv}=\frac{s}{\bar{y} }$ | 8 |
Do not get confused. The capital $S$ denotes the set of sampled elements, whereas the lower-case $s$ denotes the standard deviation of the values $y_{i\in S}$. In R:
mean(S) ## [1] 16.5 var(S) ## [1] 12.5 sd(S) ## [1] 3.536 sd(S)/mean(S) ## [1] 0.2143
Parametric variance
# Function for the parametric variance pvar <- function(x) { mx <- mean(x) vx <- sum((x - mx)^2)/length(x) return(vx) }
Related articles
- Previous article: Loading data
- Next article: Standard error and confidence intervals
References
- ↑ Kleinn, C., 2013. Lecture Notes for the Teaching Module Forest Inventory.