Thursday, September 06, 2007

Excel

I was quite surprised by the following information so I feel the need to make a public service announcement about the stdev function in excel. I normally try to resist posting about math, but I think that lots of people use the stdev function.

If you ever use the stdev function in excel it may not be doing what you expect.

Typically the formula for the standard deviation of a set of points is
sqrt(sum((x-m)^2)/n)
where x is each point, m is the mean, and n is the number of points. This formula applies when you have all the points from a population. However, if you only have a sampling of points from a population then the correct formula is actually
sqrt(sum((x-m)^2)/(n-1))
I'll explain why below.

In excel the stdev function actually uses the n-1 formula. If you want the formula that divides by n the correct function is stdevp. Clearly picking the correct one only matters if you have a small number of points, since as n gets big the difference between the two becomes small. But if you are looking for the standard deviation of a few points it can make a big difference.

Standard deviation is about the distance of points from the mean. If you have the full population then you know the mean of the population and you know the distance from every point to the mean. So you have n distances that you are using. However, if you are only taking samples from a population then you do not know the mean of the population. So you create a mean from the points you have. But if you have two points then you really only have one distance, and you cut it in half by using the mean of your samples. So in this case you really have n-1 distances (by not having the population mean you have one less degree of freedom).

A quick check: If the entire population is one point then the mean equals that point and you divide by 1 so the standard deviation is zero. If you take one sample point from a larger population then the sample mean equals the one point so the numerator is zero and n-1 = 0 so it's 0/0 which is indeterminate which makes sense since if you only take a single sample you cannot estimate the standard deviation of the population.

I had never heard of the n-1 case before today and pieced together this explanation so don't trust it too much, but do be careful if using stdev with a small number of points.

No comments: