Let’s go back to the topic of standard deviation. As mentioned here, the aim is to measure the dispersion of the points with respect to the average value. As a data point can be below or above the mean, we need to make sure that they should not cancel each other. Two possible ways to do this is either taking the absolute value or taking the square of the difference. Standard deviation is then defined as the square root from the total sum of squares of the difference between the data point and the mean.
Why taking squares? This is because if we take the absolute, there may be same values of total absolute between the scenario where the data points have similar distance to the mean and the scenario where some data points have larger distance to the mean. So we want to solve this problem by penalizing those which have larger distance to the mean. This is the purpose of taking squares of the difference instead of taking its absolute.
In any case, we need to be careful when putting threshold on standard deviation even when we already take squares of the difference.
This depends on how we come up with the threshold on the standard deviation. If we just check on several cases, we may be misled. This is because the same standard deviation can still produce quite different scenario. If we happen to approve certain threshold by looking at the case where the distance is quite uniform, then we may be thinking that certain standard deviation is ok. But the same standard deviation can be achieved when the majority of data points are quite close to the mean and then few are far from the mean. The latter case is undesirable especially if each data point should not exceed certain threshold. In this case, it is better to do some kind of automation to ensure that all the data points should not exceed certain threshold, not just the summary of the dispersion.
No comments:
Post a Comment