Resource assessment exercises: mean, variance and standard deviation

From AWF-Wiki
Revision as of 07:27, 24 May 2014 by Lburgr (Talk | contribs)

Jump to: navigation, search
Construction.png sorry: 

This section is still under construction! This article was last modified on 05/24/2014. If you have comments please use the Discussion page or contribute to the article!

This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.

Formally, the population (i.e., the ten trees) will be denoted by $U$ and consists of $N=10$ elements,

\(U=\{\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\}\)


For simplicity we will let the $i$th tree be represented by its label $i$. Thus, our finite population can be written as

\(U=\{1,2,\ldots,i,\ldots,N\}\)

Attached to each tree is the value of a study variable \(y\), e.g., the DBH. The value of the study variable for the \(i\)th element will be denoted \(y_i\). The population mean of \(y\) is defined as


$\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}$ 1


Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh in R:

mean(trees10$dbh)

## [1] 26.5


info.png What the function mean() does
The function mean(x) computes the mean of a vector x (see equation above). The $ in the last code example is used to access a single column of a data.frame.

This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values \(y_i\) in the population \(U\).

The variance, i.e., the average squared deviations of the individual values \(y_i\) from their mean, is defined as,


$\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}$ 2


In R:

pvar(trees10$dbh)

## [1] 136.2


info.png What the function pvar() does
The function pvar(x) computes the parametric variance of a vector x (see previous equation). Note, R standard variance function var() uses $N-1$ in the denominator. Here, we divide by $N$ only because the mean is not estimated, i.e., not a random variable. The definition of the function pvar() is given in parametric variance.

The square-root of the variance, \(\sigma^2\), gives the standard deviation,


$\sigma=\sqrt{\sigma^2}$ 3


The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),


$cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times 100\quad\text{(in percent)}$ 4


In R:

sqrt(pvar(trees10$dbh))

## [1] 11.67

sqrt(pvar(trees10$dbh))/mean(trees10$dbh)

## [1] 0.4405

sqrt(pvar(trees10$dbh))/mean(trees10$dbh) * 100
## [1] 44.05

Again, these values can only be calculated if we have access to all values $y_i$ in the population $U$. This is rarely the case. Usually we look at only a subset of $U$. This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.

The set of elements in a sample will be denoted $S$. The sample size, i.e., the number of elements in $S$, is given by $n$. In R we will take a simple random sample without replacement (SRSwoR) from $U$ of size $n=2$. Without replacement means, that once an element from $U$ has been selected it cannot be selected again.

S <- sample(x = trees10$dbh, size = 2)
S

## [1] 19 14


info.png What the function sample() does
the function sample(x, size) takes asample from x. The size argument defines the elements in the sample.x can either be a single numerical value, e.g., 10, or a vector of values. Ifsize is larger than x, use replace = TRUE (e.g.,sample(c("A","B"), size = 10, replace = TRUE).

The sample mean is defined as,


$\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$ 5


The difference between equation 1 and equation 5 is important. For the latter we look at only the values $y_i$ that are in the sample $S$. All other values are assumed unknown to us.

The notation we will use in the following is somewhat different from the one given in, for example, . To indicate that we look at all elements in the sample we write $i\in S$ (in words: all elements $i$ that are “element of” the sample $S$). The population and sample mean can, thus, be written as,


$\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}$ 8


For the population mean, $\mu_y$, it means that we take the sum of all $y_i$s that are element of the population $U$, whereas for the sample mean we take the sum of the values $y_i$ of all elements that are element of $S$. This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation ([eeq:mean]), for example, one could erroneously assume that we look at the first $1,2,\ldots$ elements of $U$ up to $n$. However, the elements in the sample $S$ can be any $i$ from the population, not just the first $n$ elements.

Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as


$s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}$ 9



$s=\sqrt{s^2}$ 10




$\hat{cv}=\frac{s}{\bar{y} }$ 11


Do not get confused. The capital \(S\) denotes the set of sampled elements, whereas the lower-case \(s\) denotes the standard deviation of the values \(y_{i\in S}\). In R:

## [1] 16.5
## [1] 12.5
## [1] 3.536
## [1] 0.2143

Paramatric variance

# Function for the parametric variance
pvar <- function(x) {
	mx <- mean(x)
	vx <- sum((x - mx)^2)/length(x)
	return(vx)
}
Personal tools
Namespaces

Variants
Actions
Navigation
Development
Toolbox
Print/export