Resource assessment exercises: mean, variance and standard deviation

From AWF-Wiki
Revision as of 08:59, 24 May 2014 by Lburgr (Talk | contribs)

Jump to: navigation, search
This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.

Formally, the population (i.e., the ten trees) will be denoted by $U$ and consists of $N=10$ elements,

\[U=\{\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\}\]


For simplicity we will let the $i$th tree be represented by its label $i$. Thus, our finite population can be written as

\[U=\{1,2,\ldots,i,\ldots,N\}\]

Attached to each tree is the value of a study variable $y$, e.g., the DBH. The value of the study variable for the $i$th element will be denoted $y_i$. The population mean of $y$ is defined as


$\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}$ 1


Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh in R:

mean(trees10$dbh)

## [1] 26.5


info.png What the function mean() does
The function mean(x) computes the mean of a vector x (see equation above). The $ in the last code example is used to access a single column of a data.frame.

This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values $y_i$ in the population $U$.

The variance, i.e., the average squared deviations of the individual values $y_i$ from their mean, is defined as,


$\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}$ 2


In R:

pvar(trees10$dbh)

## [1] 136.2


info.png What the function pvar() does
The function pvar(x) computes the parametric variance of a vector x (see previous equation). Note, R standard variance function var() uses $N-1$ in the denominator. Here, we divide by $N$ only because the mean is not estimated, i.e., not a random variable. The definition of the function pvar() is given in parametric variance.

The square-root of the variance, $\sigma^2$, gives the standard deviation,


$\sigma=\sqrt{\sigma^2}$ 3


The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),


$cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times 100\quad\text{(in percent)}$ 4


In R:

sqrt(pvar(trees10$dbh))

## [1] 11.67

sqrt(pvar(trees10$dbh))/mean(trees10$dbh)

## [1] 0.4405

sqrt(pvar(trees10$dbh))/mean(trees10$dbh) * 100
## [1] 44.05

Again, these values can only be calculated if we have access to all values $y_i$ in the population $U$. This is rarely the case. Usually we look at only a subset of $U$. This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.

The set of elements in a sample will be denoted $S$. The sample size, i.e., the number of elements in $S$, is given by $n$. In R we will take a simple random sample without replacement (SRSwoR) from $U$ of size $n=2$. Without replacement means, that once an element from $U$ has been selected it cannot be selected again.

S <- sample(x = trees10$dbh, size = 2)
S

## [1] 19 14


info.png What the function sample() does
The function sample(x, size) takes a sample from x. The size argument defines the elements in the sample. x can either be a single numerical value, e.g., 10, or a vector of values. If size is larger than x, use replace = TRUE (e.g.,sample(c("A","B"), size = 10, replace = TRUE).

The sample mean is defined as,


$\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$ 5


The difference between equation 1 and equation 5 is important. For the latter we look at only the values $y_i$ that are in the sample $S$. All other values are assumed unknown to us.

The notation we will use in the following is somewhat different from the one given in, for example, Kleinn (2013)[1]. To indicate that we look at all elements in the sample we write $i\in S$ (in words: all elements $i$ that are “element of” the sample $S$). The population and sample mean can, thus, be written as,

\[\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}\]

For the population mean, $\mu_y$, it means that we take the sum of all $y_i$s that are element of the population $U$, whereas for the sample mean we take the sum of the values $y_i$ of all elements that are element of $S$. This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation 5, for example, one could erroneously assume that we look at the first $1,2,\ldots$ elements of $U$ up to $n$. However, the elements in the sample $S$ can be any $i$ from the population, not just the first $n$ elements.

Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as


$s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}$ 6



$s=\sqrt{s^2}$ 7




$\hat{cv}=\frac{s}{\bar{y} }$ 8


Do not get confused. The capital $S$ denotes the set of sampled elements, whereas the lower-case $s$ denotes the standard deviation of the values $y_{i\in S}$. In R:

mean(S)
## [1] 16.5

var(S)
## [1] 12.5

sd(S)
## [1] 3.536

sd(S)/mean(S)
## [1] 0.2143


info.png What the function sd() does:
The function sd(x) computes the standard deviation of the x

Parametric variance

# Function for the parametric variance
pvar <- function(x) {
	mx <- mean(x)
	vx <- sum((x - mx)^2)/length(x)
	return(vx)
}

Related articles

References

  1. Kleinn, C., 2013. Lecture Notes for the Teaching Module Forest Inventory.
Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export