An important aspect of data distribution is the measure of spread (also called variability or dispersion) of the data. These include three measures: range, interquartile range and standard deviation.
The range of a set of data is a measure of how far apart the highest and lowest values are. To find the range, simply subtract the smallest value from the largest value. For example, if we timed a group of students on how long it took them to walk or run one mile, and the fastest time was 5 minutes and the slowest time was 23 minutes, then the range of the data set that consisted of all the times it took the group of students to complete the mile would be 23 – 5 = 18 minutes. If a second group of students were timed on this same task, and the fastest student completed the mile in 6 minutes and the slowest in 15 minutes, the range would be 14 – 6 = 9 minutes. Thus, the range – or the overall spread – of the first group is greater than that of the second group.
Quartiles are defined as the points in a distribution of data that break up the data into equal quarters, just like the median breaks the data into equal halves. The first quartile is the median of the first half of the data set, the second quartile is the median of the entire data set, and the third quartile is the median of the second half of the data set. The interquartile range is the difference between the third quartile value and the first quartile value. Let’s take a look at two distributions. (These are times for two groups of 21 students to run or walk a mile, in minutes, rounded to the nearest half minute.)
Group 1 Group 2
The median is the middle value of the distribution. With 21 values in each distribution, the middle value is the 11th value (shown in red in each distribution). This means that there are 10 values in each half of each distribution. To find the first quartile, we have to look at the 5th and 6th values and calculate the average (arithmetic mean) of the two numbers. In the distribution for group 1, these two values are both 6 (shown in green above), so the first quartile is 6 because the average of 6 and 6 is 6. In the distribution for group 2, the two values are 7.5 and 8.5 (shown in blue above). The average of these two values is: Therefore, the first quartile of group 2 is 8.
To find the third quartile, we look at the 5th and 6th values of the bottom half of each distribution. For group 1, these two values are 11, so the third quartile is 11. For group 2, these two values are 12, so the third quartile is 12.
Now we can find the interquartile range (IQR) for each distribution. For group 1, the IQR is 11 – 6 = 5. For group 2, the IQR is 12 – 8 = 4. This is an indication that the times for group 2 are closer together than those of group 1.
Box and Whisker Plot
A box and whisker plot is a good way to visually show the spread of the data. The box and whisker uses the minimum value of the distribution, the quartiles, and the maximum value. The diagram below shows the box and whisker plots for the two distributions given above. From this diagram, it is easy to see that the data for group 2 is more compact and the interquartile range (the distance across the box) is smaller for group 2 than it is for group 1.
While the range uses just two values to give a measure of spread and the interquartile range takes into account several more values, standard deviation is a measure of spread that uses all the data. Thus, standard deviation is considered a stronger measure of variability than the other two measures. Basically, standard deviation is a measure of the average distance each value in a distribution is from the mean of the distribution. There is a formula that is used to calculate the standard deviation, s:
Basically, the formula says to take each value, x, in the data set and subtract the mean,, from it and square the result. Then add up all those squared values and divide the sum by the number of values, N, in the distribution. Finally, take the square root of the result. (The squaring and square root “undo” each other; the adding and dividing by the number of values is like finding the mean.)
Let’s look again at the distributions for the two groups of students given above. For group 1, the mean is 9.2 and the standard deviation is approximately 4.4. For group 2, the mean is 10.15 and the standard deviation is about 2.6. These standard deviations are another indication that there is greater spread (or variability) in the values in the first group than in the second group. This can be seen in the normal plot of each distribution, which has at its center the mean of the distribution. The smaller standard deviation of group 2 is seen in the narrowness of the curve (shown in blue) below.
To develop a deeper understanding of the measures of variability, consider pursuing a degree in data science or analytics.