| Module 2: Describing, Clarifying and Presenting Data
4. Summarising data
4.5. Describing the shape of distributions
Describing the shape of distributions is really only of interest if you are looking at a quantitative variable. Stem-and-leaf plots and histograms can tell you how the variable is distributed in the population. For example, the stem-and-leaf plot of student marks was fairly symmetric with short tails although there was also an outlier.
"Stem and Leaf" Plot
However, the example of new housing starts in the USA could be described as skewed to the right because the higher values are more spread out than the lower values. There are certain terms you can use to describe the shape of a distribution.
What terminology can you use to describe the shape of a distribution?
1. Is the shape of the distribution symmetric, that is, is the shape similar either side of a central axis (like a mirror image)? Or is the shape skewed, i.e. asymmetric?
2. Does the shape have long tails or short tails of data?
3. Does the shape have only one mode and therefore can be described as unimodal or does it have more than one and can be described as bimodal (for two modes) or multimodal (if it has three or more modes).
4. Are there any outliers?
You can describe the shape of the distribution of quantitative data using these terms. Describing the shape is a first step in understanding the distribution of a variable. However, you cannot describe the shape of a categorical variable in this way because the categories of a categorical variable have no specific order (think of hair colour).
When we describe the shapes of distributions, a common reference shape is the normal distribution.
The normal distribution (normal curve) is also described as a “bell-shaped” curve (because it has the same cross-sectional shape as a large church bell). A typical normal curve is shown below:
Properties of a normal curve:
- Symmetrical (identical shape on each side of the centre line)
- One mode
- Mean, median and mode are together (at the centre!)
- Most data values are clustered around the centre (in fact 68% of the data is within a band which extends from one Standard Deviation unit under the mean to one SD unit above the mean
A normal curve can be seen to arise from a (practical) situation where there are a large number of histogram bars describing a data set. The normal curve is the envelope which smooths out the shape produced by the tops of the histogram bars.
The marks on the horizontal axis are blocks of size one SD unit. You can see that the interval from 3 SD units below the mean to 3 SD units above the mean contains almost 100% of the data. In other words, it is unusual for normally distributed data to be more than 3 SD units away from the mean.
Consider the following practical data
Following is some data on word length in literature.
The students who collected this data wanted to compare the nature of articles in the popular magazine New Weekly with articles in New Scientist. Their hypothesis was that the writing style would reflect the type of magazine – i.e., that word length distribution would be an indicator of the complexity of issues discussed in each magazine.
A typical article was chosen at random from each magazine.
The first two graphs show:
(i) the frequency distribution of word lengths
(ii) the relative frequency distribution of word lengths.
| word length | frequency | relative frequency |
| 1 | 10 | 0.040 |
| 2 | 40 | 0.160 |
| 3 | 55 | 0.220 |
| 4 | 50 | 0.200 |
| 5 | 25 | 0.100 |
| 6 | 20 | 0.080 |
| 7 | 15 | 0.060 |
| 8 | 15 | 0.060 |
| 9 | 10 | 0.040 |
| 10 | 5 | 0.020 |
| 11 | 3 | 0.012 |
| 12 | 2 | 0.008 |
| | 250 | 1.000 |
You can see that the two graphs above concerning New Weekly are identical, except for the numbering on the vertical axes.
You can also see that:
- the New Weekly article favours shorter words – longer words are not common
- the distribution (graph) is not symmetrical - the variable ‘word length’ is distributed unevenly
Now consider the New Scientist article.
The article from the New Scientist is longer – 425 words compared to 250. Hence, for meaningful comparison, we use relative frequency on the vertical axis.
| word length | frequency | relative frequency |
| 1 | 5 | 0.012 |
| 2 | 12 | 0.028 |
| 3 | 25 | 0.059 |
| 4 | 35 | 0.082 |
| 5 | 55 | 0.129 |
| 6 | 70 | 0.165 |
| 7 | 80 | 0.188 |
| 8 | 53 | 0.125 |
| 9 | 40 | 0.094 |
| 10 | 28 | 0.066 |
| 11 | 12 | 0.028 |
| 12 | 10 | 0.024 |
| | 425 | 1.000 |
You have now finished Module 2.
|