|Module 2: Describing, Clarifying and Presenting Data
4. Summarising data
4.1. Measuring the centre of a distribution
There are several measures of centre but here we consider three:
What is the median?
The median is the middle value of a ranked data set - so that half of the data falls above it and half below it.
This is easy if there is an odd number of data, but when there is an even number of data you need to find the two central values, add them together and then divide them by two to obtain the median. Once again, let’s look at the student grades from the beginning of this module.
52, 64, 16, 48, 35, 52, 85, 96, 90, 87, 77, 78, 37, 68, 62, 60, 51, 55, 57, 64, 54, 51, 62, 43, 68, 71, 76, 68, 65, 83, 47, 44, 76
To restate the process: in this example, the total number of marks is 33. The middle value of the ranked set is the 17th mark and so the median mark is 62. Note that there are 16 data values below 62 and 16 data values above 62 (16 + 1 + 16 = 33)
Consider a smaller data set of 8 values:
Rank the data (smallest to largest):
When there are an even number of data, the data set splits evenly and the median is not a member of the data set.
In this case, the median will be at position 4_ - halfway between the data in the 4th position ($30) and the data in the 5th position ($31). Therefore, the value (size) of the median is
Note that there are 4 values lower than $30.50 and 4 values higher than $30.50
1. Identify the size of the data set (n).
2. Rank the values of the data set (usually lowest to highest).
3. Locate the position of the median – it is found at position
4. Last, identify the size (value) of the data value at that position and quote it as the median..
Return to the student marks data set.
The structure that has been drawn here is a table but it is also drawn in graphical form. In this type of frequency histogram (frequency is really just another name for counts), data have been collected into cells. This allows you to get an idea of the shape of the distribution of the data. A stem and leaf plot  is a shorthand way of doing the same thing without sacrificing information. With a stem-and-leaf plot you must always include a statement about the size of the data. In the example above, the stems are tens, as shown in the key and the leaves are units (values of one). This means the size of the ‘5’ in the stem is actually ‘50’. And that stem really includes all marks from 50 to 59 inclusive.
Creating a Histogram from a Stem-and-Leaf Plot
|Test your knowledge|
How many marks are in the 60s?
- 68 marks
- 8 marks
- 9 marks
- 2 marks
Click here for answers
Did you answer '9 marks'? If you did, then you are demonstrating the ability to read a stem-and-leaf plot. Stem-and-leaf plots also provide a picture of the spread and shape of the distribution of a particular variable within a sample. More about that later.
If you were to convert each leaf beside a stem (also called a class) into a rectangle, it would look something like this:
If you then rotated the histogram and removed the horizontal lines separating the rectangles in each class, you would end up with a classical graphical display of a histogram.
The height of the rectangles above each class is proportional to the number of data values that fall into that class.
ii. Finding the mean
The mean can be described as the arithmetic average. Statisticians use symbols and equations to show how the mean can be calculated.
Don’t be put off by this equation. Remember, to calculate the mean is to calculate the mathematical average. Therefore, essentially you are adding together all the measurements and then dividing that total by the number of measurements. For this set of student marks the total number of measurements is 33. The sum of these 33 measurement values is 1964. The mean is calculated by dividing 1964 by 33 and is 59.58. Rounding gives a mean of approximately 60. It is often useful to round statistics, especially summary statistics such as the mean, for presentation purposes.
If your mark for the subject was 76, are you above or below the mean for the class?
You are above the mean of 62.
NOTE: In this case the mean of 60 is slightly smaller than the median. This is because the mean is affected by the numerical value of every measurement, so a very low score like 16 affects the mean. Likewise, a very large data will drag the mean upwards. The median is affected only by the relative position of measurements and so 16 has the same effect on the median as any other number below 62. The median is not affected by the size of extreme data values; it is affected by the number of data in the data set.
What is the Mode?
The mode is the most common value in a data list. It is the value with the highest frequency. In the example of student marks, the mode is 68 because it occurs three times (i.e. three students obtained 68). The mode can be useful with categorical or discrete variables. For example, if you managed a shoe shop you might find the mode a useful concept because it could tell you which men's and women's shoe sizes are the most common among your customers.