Resource assessment exercises: mean, variance and standard deviation

sorry:

This section is still under construction! This article was last modified on 05/24/2014. If you have comments please use the Discussion page or contribute to the article!

This article is part of the Resource assessment exercises. See the category page for a (chronological) table of contents.

Formally, the population (i.e., the ten trees) will be denoted by $U$ and consists of $N=10$ elements,

$U=\{\text{tree}_1,\text{tree}_2,\ldots,\text{tree}_i,\ldots,\text{tree}_N\}$

For simplicity we will let the $i$th tree be represented by its label $i$. Thus, our finite population can be written as

$U=\{1,2,\ldots,i,\ldots,N\}$

Attached to each tree is the value of a study variable $y$, e.g., the DBH. The value of the study variable for the $i$th element will be denoted $y_i$. The population mean of $y$ is defined as

$\mu_y=\frac{1}{N}\sum_{i=1}^Ny_{i.}$

1

Here, our study, or target variable is the DBH. We calculate the population mean of the variable dbh in R:

mean(trees10$dbh)

## [1] 26.5

What the function mean() does: The function mean(x) computes the mean of a vector x (see equation above). The $ in the last code example is used to access a single column of a data.frame.

This is the population mean, or so-called parametric mean. We can only calculate a parameter when we conduct a census. That is, we have access to all values $y_i$ in the population $U$.

The variance, i.e., the average squared deviations of the individual values $y_i$ from their mean, is defined as,

$\sigma^2=\frac{\sum_{i=1}^N(y_i-\mu_y)^2}{N}$

2

In R:

pvar(trees10$dbh)

## [1] 136.2

What the function pvar() does: The function pvar(x) computes the parametric variance of a vector x (see previous equation). Note, R standard variance function var() uses $N-1$ in the denominator. Here, we divide by $N$ only because the mean is not estimated, i.e., not a random variable. The definition of the function pvar() is given in parametric variance.

The square-root of the variance, $\sigma^2$, gives the standard deviation,

$\sigma=\sqrt{\sigma^2}$

3

The standard deviation divided by the mean provides the coefficient of variation (often expressed in percent),

$cv=\frac{\sigma}{\mu_y},\quad\text{or}\quad cv(\%)=\frac{\sigma}{\mu_y}\times 100\quad\text{(in percent)}$

4

In R:

sqrt(pvar(trees10$dbh))

## [1] 11.67

sqrt(pvar(trees10$dbh))/mean(trees10$dbh)

## [1] 0.4405

sqrt(pvar(trees10$dbh))/mean(trees10$dbh) * 100
## [1] 44.05

Again, these values can only be calculated if we have access to all values $y_i$ in the population $U$. This is rarely the case. Usually we look at only a subset of $U$. This subset of elements will be called a sample. Throughout this document we will assume that some chance mechanism “decides” which elements end up in the sample. We talk about a random sample if sample selection is done at random.

The set of elements in a sample will be denoted $S$. The sample size, i.e., the number of elements in $S$, is given by $n$. In R we will take a simple random sample without replacement (SRSwoR) from $U$ of size $n=2$. Without replacement means, that once an element from $U$ has been selected it cannot be selected again.

S <- sample(x = trees10$dbh, size = 2)
S

## [1] 19 14

What the function sample() does: the function sample(x, size) takes asample from x. The size argument defines the elements in the sample.x can either be a single numerical value, e.g., 10, or a vector of values. Ifsize is larger than x, use replace = TRUE (e.g.,sample(c("A","B"), size = 10, replace = TRUE).

The sample mean is defined as,

$\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$

5

The difference between equation 1 and equation 5 is important. For the latter we look at only the values $y_i$ that are in the sample $S$. All other values are assumed unknown to us.

The notation we will use in the following is somewhat different from the one given in, for example, . To indicate that we look at all elements in the sample we write $i\in S$ (in words: all elements $i$ that are “element of” the sample $S$). The population and sample mean can, thus, be written as,

$\mu_y=\frac{1}{N}\sum_{i\in U}y_i\quad\text{(population mean), and}\quad\bar{y}=\frac{1}{n}\sum_{i\in S}y_i\quad\text{(sample mean)}$

8

For the population mean, $\mu_y$, it means that we take the sum of all $y_i$s that are element of the population $U$, whereas for the sample mean we take the sum of the values $y_i$ of all elements that are element of $S$. This notation seems to be standard in the survey statistics literature, and it avoids confusion. If we look at equation ([eeq:mean]), for example, one could erroneously assume that we look at the first $1,2,\ldots$ elements of $U$ up to $n$. However, the elements in the sample $S$ can be any $i$ from the population, not just the first $n$ elements.

Having that said, the variance, standard deviation and coefficient of variation for a single sample are defined as

$s^2=\frac{\sum_{i\in S}(y_i-\bar{y})^2}{n-1}$

9

$s=\sqrt{s^2}$

10

$\hat{cv}=\frac{s}{\bar{y} }$

11

Do not get confused. The capital $S$ denotes the set of sampled elements, whereas the lower-case $s$ denotes the standard deviation of the values $y_{i\in S}$. In R:

## [1] 16.5

## [1] 12.5

## [1] 3.536

## [1] 0.2143

Paramatric variance

# Function for the parametric variance
pvar <- function(x) {
	mx <- mean(x)
	vx <- sum((x - mx)^2)/length(x)
	return(vx)
}

Resource assessment exercises: mean, variance and standard deviation

Paramatric variance

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Development

Toolbox

Print/export