Resource assessment exercises: mean, variance and standard deviation
sorry: |
This section is still under construction! This article was last modified on 05/17/2014. If you have comments please use the Discussion page or contribute to the article! |
- This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.
Formally, the population (i.e., the ten trees) will be denoted by \(U\) and consists of \(N=10\) elements,
\[U=\{\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\}.\]
For simplicity we will let the \(i\)th tree be represented by its label \(i\). Thus, our finite population can be written as
\[U=\{1,2,\ldots,i,\ldots,N\}.\]
Attached to each tree is the value of a study variable \(y\), e.g., the DBH. The value of the study variable for the \(i\)th element will be denoted \(y_i\). The population mean of \(y\) is defined as
\[\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}\]
Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh
in R:
mean(trees10$dbh) ## [1] 26.5
- What the function
mean()
does - The function
mean(x)
computes the mean of a vectorx
(see equation above). The$
in the last code example is used to access a single column of adata.frame
.
This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values \(y_i\) in the population \(U\).
The variance, i.e., the average squared deviations of the individual values \(y_i\) from their mean, is defined as,
\[\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}. \]
In R:
## [1] 136.2
The square-root of the variance, \(\sigma^2\), gives the standard deviation,
\[\sigma=\sqrt{\sigma^2}.
The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),
:\(cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times
100\quad\text{(in percent)}.
In :
## [1] 11.67
## [1] 0.4405
## [1] 44.05
Again, these values can only be calculated if we have access to all values <math>y_i\] in the population <math>U\). This is rarely the case. Usually we look at only a subset of \(U\). This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.
The set of elements in a sample will be denoted \(S\). The sample size, i.e., the number of elements in \(S\), is given by \(n\). In we will take a simple random sample without replacement (SRSwoR) from \(U\) of size \(n=2\). Without replacement means, that once an element from \(U\) has been selected it cannot be selected again.
## [1] 19 14
The sample mean is defined as,
\[\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i. The difference between equation ([eeq:popmean]) and equation ([eeq:mean]) is important. For the latter we look at only the values \(y_i\] that are in the sample <math>S\). All other values are assumed unknown to us.
The notation we will use in the following is somewhat different from the one given in, for example, . To indicate that we look at all elements in the sample we write \(i\in S\) (in words: all elements \(i\) that are “element of” the sample \(S\)). The population and sample mean can, thus, be written as,
\[\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}. \]
For the population mean, \(\mu_y\), it means that we take the sum of all \(y_i\)s that are element of the population \(U\), whereas for the sample mean we take the sum of the values \(y_i\) of all elements that are element of \(S\). This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation ([eeq:mean]), for example, one could erroneously assume that we look at the first \(1,2,\ldots\) elements of \(U\) up to \(n\). However, the elements in the sample \(S\) can be any \(i\) from the population, not just the first \(n\) elements.
Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as
\[s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}, \]
\[s=\sqrt{s^2}, \]
and
\[\hat{cv}=\frac{s}{\bar{y}}. \]
Do not get confused. The capital \(S\) denotes the set of sampled elements, whereas the lower-case \(s\) denotes the standard deviation of the values \(y_{i\in S}\). In :
## [1] 16.5
## [1] 12.5
## [1] 3.536
## [1] 0.2143