The mean of a numeric variable is calculated by adding together the values of all observations in a dataset and then dividing by the number of observations in the set. It is often referred to as the average. Thus:
Mean = sum of all the observations ÷ number of observations
For example, find the mean of these numbers 5, 3, 4, 5, 7, 6.
|Mean ||= (5 + 3 + 4 + 5 + 7 + 6) ÷ 6|
|= 30 ÷ 6|
Notice that the value of every member of the dataset is used to calculate the mean.
The median is the middle value of a set of odd numbered data, or the mean of the middle two in an even numbered set after the data have been placed in ascending order.
For example, dataset A contains 3, 7, 1, 9, 2, 5, 9. Rearranged in ascending order it becomes: 1, 2, 3, 5, 7, 9, 9. The middle number is 5 so, the median is 5.
Dataset B contains 1, 3, 4, 5, 10, 12, 13, and also has a median of 5 although the values of the data vary considerably.
The position of the median can also be found by using the formula (n + 1) ÷ 2 , where n is the number of values in a set of ordered data.
For dataset A: n = 7
|So the position of the median ||= (7 + 1) ÷ 2|
|= 8 ÷ 2|
The median is the fourth number which has a value of 5.
The above example is for an odd number of observation, i.e. n = 7. However, an extra step is necessary when the number of observations is even.
For example, if n = 8 then
This means that the position of the median lies between the fourth and fifth observations. To find the value of the median, add together the fourth and fifth observations and divide by two. For example, if the dataset is: 1, 1, 4, 4, 8, 9, 9, 10 then the median is, (4 + 8) / 2 = 6.
|the position of the median||= (8 + 1) ÷ 2|
= 9 ÷ 2
The median value is decided by its location in the ordered dataset and not because of its actual value. Notice that the values of the other members of the dataset are not taken into consideration, only their position. There are as many values above the median as there are below.
The median is usually calculated for numeric variables but may also be calculated for an ordinal nominal variable.
The mode is the most frequently observed value in a dataset. Mode is the only measure you can use when the data is categorical and has no order – for example, place of birth, favourite colour and hair colour. As the dataset is not numbers, you cannot add and divide, so you cannot find a mean. The dataset cannot be sorted from smallest to largest so you cannot find the middle value and median. The mode does not necessarily give an indication of a dataset’s centre. A set of data can have more than one mode (see Figure 1).
For example, a group friends in Year 10 have the following hair colours: red, brown, blonde, black, blonde, black, brown, brown, black, blonde, brown, brown, black.
|HAIR COLOUR ||FREQUENCY|
The most common hair colour is brown so the mode is brown.
The range is the actual spread of data including any outliers. It is the difference between the highest and lowest observation.
Range = maximum value – minimum value
For the following dataset of students' ages: 17, 15, 14, 16, 14, 15, 16, 12, 17, 13, 12, 17, 13, 16, 15
|Range||= maximum value – minimum value|
|= 17 – 12|
The range of the student's ages is 5 years.
Quartiles divide data into four equal groups. Using the example of 15 students above, we have the following ordered dataset: 12, 12, 13, 13, 14, 14, 15, 15, 15, 16, 16, 16, 17, 17, 17. We can divide this set into four equal sized groups with each group containing one quarter of the data:
- The first quartile (Q1) is the value that 25% of the data is below.
- The second quartile (Q2) is the value that 50% of the data is below. This is the same as the median.
- The third quartile (Q3) is the value that 75% of the data is below.
|In the example:||Q1 = 13|
Q2 = 15
Q3 = 16
The interquartile range refers to the middle 50% of data. Another way to put it is the interquartile range is the difference between the upper (75%) and lower (25%). The interquartile range is an indicator of the spread of the data. It eliminates the influence of outliers since the highest and lowest quarters are removed. The interquartile range is found by subtracting Q1 from Q3.
Five number summary (quartiles)
This is a useful way to summarise data. It consists of:
The range can be found from the difference between the highest and lowest value. The median is the second quartile (Q2) and the interquartile range is the difference between the third and first quartiles (Q3 – Q1).
- the lowest value
- the highest value
- the first quartile (Q1)
- the third quartile (Q3)
- the second quartile (Q2).
Standard deviation (s) is the measure of spread most commonly used when the mean is the measure of centre. Standard deviation is most useful for symmetric distributions with no outliers.
The standard deviation for a discrete variable made up of n observations is the positive square root of the variance as show in Figure 3.
Fig 1 Unimodal, bimodal and multimodal
Fig 2 Quartiles
Fig 3 Standard deviation formula