|Module 2: Describing, Clarifying and Presenting Data
4. Summarising data
4.4. Describing the spread of the distribution of data
The spread of the data tells you about variability.
This is useful information. It enriches the summary information provided by measures of centre. For example, suppose that your mark in a particular subject was 65 and the mean was 60. These values might be interpreted very differently if the marks varied from 56 to 65 (a range of 9) than if they varied from 16 to 92 (a range of 76). The measure of spread given here is called the range. Although it is the simplest measure of spread, it is not very useful for comparing variability between data sets of different sizes. Other measures of spread such as interquartile ranges and standard deviation provide more reliable information.
Using standard deviation
The mean is the most commonly used measure of centre. When mean is used, then a measure called standard deviation is generally used to measure the spread of the data. It measures the overall deviation of observations from their mean by computing the average of the squares of these deviations from the mean and then finding the square root of that value. We can write it as:
Since most data comes from samples rather than populations, we use the formula:
If you are looking at a single variable, then an adequate summary of the distribution of data from one variable requires both a measure of centre and a measure of spread. If mean and median are both used to summarise 'centre', they can sometimes be used to describe the shape of the distribution. If the median and the mean are virtually the same, this indicates a distribution that is approximately symmetric about the centre. If not, the difference between mean and median is an indicator of an asymmetric shape.
Let's look at an example.
In 2003 the number of new private housing commencements was measured by the US Bureau of Census for each of the 50 states. The histogram of the distribution follows where the horizontal axis measures numbers of home starts, and the vertical (frequency) axis measures numbers of states:
(Weiers, 2005, p.85)
Notice that there is a column “more” – it contains the three outliers 143.1 and 202.6 and 271.4 – these were the states Texas, Florida and California respectively. These are popular states with large (internal) immigration rates and had a housing boom.
The median number of starts was 18.3 (i.e., 18,300 homes) and the average was 34.6 (34,600 homes).
The lowest 10% of the data is cut off at 2.6 – this is called the first decile. Another way of saying this is “there is 10% of the area of the distribution from 0 to 2.6”.
The lowest 20% is cut off at 5.0 – the second decile.
The lowest 30% is cut off at 8.8 – the third decile.
Each of the bands 0 – 2.6 and then 4.0 – 5.0 and then 5.2 to 8.8 contain 10% of the area under the distribution. In total they contain the (lowest) 30%.
Using a continuous curve to summarise the shape of the histogram gives the shape of the distribution to be:
(Weiers, 2005, p.85)
Using deciles, percentiles and quartiles
Note the graphical interpretation of median (18.3) and deciles in terms of area. The lower decile is the point below which 10% of the distribution lies. Areas are equal above and below the median and 10% of area is below 2.6.
Deciles are a particular case of a more general measure, percentiles. A data distribution is commonly ‘broken up’ into 100 sections. The thirtieth percentile is the data value (boundary) below which 30% of the data lies.
The most commonly used percentiles are quartiles. If median is used as the measure of centre, then specific percentiles can be used as a measure of variability and to show the general shape of the distribution.
The first (or lower) quartile (25th percentile) is the data value below which 25% of the data set lies. The second quartile is the 50th percentile, the data value below which 50% of the data set lies (this is, of course, the median). The median is the 50th percentile. The 25th percentile is called the lower quartile and is the median of all the values below the overall median and the 75th percentile is called the upper quartile and is the median of the values above the overall median. If the lowest and highest individual values are also reported, then you have what is called the five number summary.
What is a box plot?
The five number summary of a distribution can be represented graphically as a box plot. Individual box plots can be used to represent distributions. For example, if you wanted to draw a box plot of the new housing starts that were described above it would look like the diagram below (with 3 outliers omitted for scaling reasons):
To consider another example
Box plots can be used to compare the prime-time ratings of two hypothetical television networks. Ratings are a measurement of the television viewing preferences of a sample of the population. The more people who watch a television show, the more popular it is and the higher its ratings. Usually advertisers pay more to have their products advertised on a high rating show than a low rating show.
|Test your knowledge|
Answer the following questions
- Which station had the most popular shows?
- Which station had the most popular show?
- If you were an advertiser and you wanted to reach the largest number of people during prime-time which station would you choose?
Click here for answers