Resource assessment exercises: mean, variance and standard deviation
sorry: |
This section is still under construction! This article was last modified on 05/24/2014. If you have comments please use the Discussion page or contribute to the article! |
- This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.
Formally, the population (i.e., the ten trees) will be denoted by $U$ and consists of $N=10$ elements,
\[U=\{\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\}\]
For simplicity we will let the $i$th tree be represented by its label $i$. Thus, our finite population can be written as
\[U=\{1,2,\ldots,i,\ldots,N\}\]
Attached to each tree is the value of a study variable \(y\), e.g., the DBH. The value of the study variable for the \(i\)th element will be denoted \(y_i\). The population mean of \(y\) is defined as
$\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}$ | 1 |
Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh
in R:
mean(trees10$dbh) ## [1] 26.5
- What the function
mean()
does - The function
mean(x)
computes the mean of a vectorx
(see equation above). The$
in the last code example is used to access a single column of adata.frame
.
This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values \(y_i\) in the population \(U\).
The variance, i.e., the average squared deviations of the individual values \(y_i\) from their mean, is defined as,
$\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}$ | 2 |
In R:
pvar(trees10$dbh) ## [1] 136.2
- What the function
pvar()
does - The function
pvar(x)
computes the parametric variance of a vectorx
(see previous equation). Note, R standard variance functionvar()
uses $N-1$ in the denominator. Here, we divide by $N$ only because the mean is not estimated, i.e., not a random variable. The definition of the functionpvar()
is given in parametric variance.
The square-root of the variance, \(\sigma^2\), gives the standard deviation,
$\sigma=\sqrt{\sigma^2}$ | 3 |
The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),
$cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times 100\quad\text{(in percent)}$ | 4 |
In R:
sqrt(pvar(trees10$dbh)) ## [1] 11.67 sqrt(pvar(trees10$dbh))/mean(trees10$dbh) ## [1] 0.4405 sqrt(pvar(trees10$dbh))/mean(trees10$dbh) * 100 ## [1] 44.05
Again, these values can only be calculated if we have access to all values $y_i$ in the population $U$. This is rarely the case. Usually we look at only a subset of $U$. This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.
The set of elements in a sample will be denoted $S$. The sample size, i.e., the number of elements in $S$, is given by $n$. In R we will take a simple random sample without replacement (SRSwoR) from $U$ of size $n=2$. Without replacement means, that once an element from $U$ has been selected it cannot be selected again.
S <- sample(x = trees10$dbh, size = 2) S ## [1] 19 14
- What the function
sample()
does - the function
sample(x, size)
takes asample fromx
. Thesize
argument defines the elements in the sample.x
can either be a single numerical value, e.g., 10, or a vector of values. Ifsize
is larger thanx
, usereplace = TRUE
(e.g.,sample(c("A","B"), size = 10, replace = TRUE)
.
The sample mean is defined as,
$\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$ | 5 |
The difference between equation 1 and equation 5 is important. For the latter we look at only the values $y_i$ that are in the sample $S$. All other values are assumed unknown to us.
The notation we will use in the following is somewhat different from the one given in, for example, . To indicate that we look at all elements in the sample we write $i\in S$ (in words: all elements $i$ that are “element of” the sample $S$). The population and sample mean can, thus, be written as,
$\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}$ | 8 |
For the population mean, $\mu_y$, it means that we take the sum of all $y_i$s that are element of the population $U$, whereas for the sample mean we take the sum of the values $y_i$ of all elements that are element of $S$. This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation ([eeq:mean]), for example, one could erroneously assume that we look at the first $1,2,\ldots$ elements of $U$ up to $n$. However, the elements in the sample $S$ can be any $i$ from the population, not just the first $n$ elements.
Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as
$s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}$ | 9 |
$s=\sqrt{s^2}$ | 10 |
$\hat{cv}=\frac{s}{\bar{y} }$ | 11 |
Do not get confused. The capital \(S\) denotes the set of sampled elements, whereas the lower-case \(s\) denotes the standard deviation of the values \(y_{i\in S}\). In R:
## [1] 16.5
## [1] 12.5
## [1] 3.536
## [1] 0.2143
Paramatric variance
# Function for the parametric variance pvar <- function(x) { mx <- mean(x) vx <- sum((x - mx)^2)/length(x) return(vx) }