Australian Bureau of Statistics

Rate this page
ABS @ Facebook ABS @ Twitter ABS RSS ABS Email notification service
Education Services

Education Services homepage

Teacher Statistical Literacy

Back to Education Services home page

Concepts and definitions

Click on the triangles to open a section. The table below is a list of the concepts covered in each section.

    Show details for StatisticsStatistics
    Show details for VariablesVariables
    Hide details for SamplingSampling

    A census is a collection of information from all units in the population. The Census of Population and Housing is a statistical collection that aims to accurately measure the number of persons in Australia on Census night, their key characteristics and the dwellings in which they live.

    An estimate is an inference for the target population using information obtained from a sample of the population.

    A part of a population selected for the purpose of studying certain characteristics of an entire population of interest. A sample is used to represent the population. You can often get a response form a sample where it would not be possible to get a response form every member of the population.

    Sample size
    The sample size is the number of units, including persons, households, businesses and schools etc, being surveyed. In general, the larger the sample size, the smaller the sampling error.

    Random sample
    In a random sample, all units in the target population have an equal chance of selection.

    Simple random sample
    All members of the sample are chosen at random and have the same chance of being in the sample.
    A Tattslotto draw is a good example of simple random sampling. A sample of six numbers is randomly generated from a population of 45 with each number having an equal chance of being selected.

    Systematic random sample
    The first member of the sample is chosen at random then the other members of the sample are taken at intervals.

    Stratified random sample
    Relevant subgroups form within the population are identified and random samples are selected from within each strata.
    For example, a school has 24 Year 7 students. Eight of the students are 11 years old, twelve are 12 years old and four are 13 years old. The strata are the ages of students.
    To take a stratified sample, select one quarter of the students in each age group – for instance, two students form the 11-year-olds, three students from the 12-year-olds and ones student who is 13 years old.
    In this example, the strata are proportionally represented; however, this will not always be the case. The important thing to remember is to take a random sample from each strata.

    Non-random sample
    In a non-random sample, the chance of a member of the population being in the sample is unknown. The accuracy of the sample in representing the population is unknown.

    Quota sample
    This is a type of stratified sampling in which selection within the strata is non-random. Quota sampling requires setting a number of participants to include in a survey – usually a proportion of the population.
    Take the example of Year 7 students from the stratified random sample above who are in strata of age groups. Unlike stratified random sampling where participants are selected at random, participants in a quota sample are selected to fill the quotas.
    For instance, the first 15 twelve-year-old Year 7 students to arrive at school on any given day may be selected. However, this sample may not be representative of all twelve-year-olds in Year 7.

    Convenience sample
    In a convenience sample, participants are selected by how easy it is to reach them.
    For example, the first ten students to walk through the front gates of the school is an easy sample to take. Convenience sampling does not produce a representative sample of the population because people or things that can be reached easily and conveniently are likely to be different to those that are harder to reach.

    Volunteer sample
    This is where participants volunteer to be part of the survey.
    Phone-in sampling is a common method of volunteer sampling used by television and radio stations to measure public opinion. People are asked to telephone or SMS their vote on a particular issue by a certain time. There is no control over how many people vote.
    There are two main problems with this type of sampling. Firstly, there is no limit to the number of times a person can vote, and secondly, those not interested in voting will not be included in the sample. People who don't call in may have different views to the people who are calling in.
    Additionally, only those watching television or listening to the radio know that there is a survey taking place.
    As such, volunteer sampling is unlikely to produce a sample that accurately represents the population.

    Sampling error
    Sampling error is the difference between an estimate derived from a sample survey and the true value that would result if a census of the whole population was taken.

    Non-sampling error
    Non-sampling errors are not caused by sampling methodology. They can be made by participants and interviewers when the questionnaire is being filled in. or they can happen when the questionnaire is being processed.

    Hide details for Frequency and distributionFrequency and distribution

    The frequency (f) of a particular observation is the number of times the observation occurs in that data.

    Cumulative frequency
    Cumulative frequency is the total of a frequency and all frequencies below it in a frequency distribution. It is the running total of frequencies.

    Relative frequency
    Relative frequency is another term for proportion. It is the number of times a particular observations occurs divided by the total number of observations.

    The distribution of a variable is the pattern of values of the observations.

    Show details for Graphs and displaysGraphs and displays
    Hide details for Summary statisticsSummary statistics

    The mean of a numeric variable is calculated by adding together the values of all observations in a dataset and then dividing by the number of observations in the set. It is often referred to as the average. Thus:
    Mean = sum of all the observations number of observations

    For example, find the mean of these numbers 5, 3, 4, 5, 7, 6.

    Mean = (5 + 3 + 4 + 5 + 7 + 6) 6
    = 30 6
    = 5

    Notice that the value of every member of the dataset is used to calculate the mean.

    The median is the middle value of a set of odd numbered data, or the mean of the middle two in an even numbered set after the data have been placed in ascending order.

    For example, dataset A contains 3, 7, 1, 9, 2, 5, 9. Rearranged in ascending order it becomes: 1, 2, 3, 5, 7, 9, 9. The middle number is 5 so, the median is 5.
    Dataset B contains 1, 3, 4, 5, 10, 12, 13, and also has a median of 5 although the values of the data vary considerably.

    The position of the median can also be found by using the formula (n + 1) 2 , where n is the number of values in a set of ordered data.

    For dataset A: n = 7

    So the position of the median = (7 + 1) 2
    = 8 2
    = 4

    The median is the fourth number which has a value of 5.

    The above example is for an odd number of observation, i.e. n = 7. However, an extra step is necessary when the number of observations is even.
    For example, if n = 8 then
    the position of the median= (8 + 1) 2
    = 9 2
    = 4.5
    This means that the position of the median lies between the fourth and fifth observations. To find the value of the median, add together the fourth and fifth observations and divide by two. For example, if the dataset is: 1, 1, 4, 4, 8, 9, 9, 10 then the median is, (4 + 8) / 2 = 6.

    The median value is decided by its location in the ordered dataset and not because of its actual value. Notice that the values of the other members of the dataset are not taken into consideration, only their position. There are as many values above the median as there are below.

    The median is usually calculated for numeric variables but may also be calculated for an ordinal nominal variable.

    The mode is the most frequently observed value in a dataset. Mode is the only measure you can use when the data is categorical and has no order – for example, place of birth, favourite colour and hair colour. As the dataset is not numbers, you cannot add and divide, so you cannot find a mean. The dataset cannot be sorted from smallest to largest so you cannot find the middle value and median. The mode does not necessarily give an indication of a dataset’s centre. A set of data can have more than one mode (see Figure 1).

    For example, a group friends in Year 10 have the following hair colours: red, brown, blonde, black, blonde, black, brown, brown, black, blonde, brown, brown, black.


    Red 11
    Brown 55
    Black 44
    Blonde 33

    The most common hair colour is brown so the mode is brown.

    The range is the actual spread of data including any outliers. It is the difference between the highest and lowest observation.
    Range = maximum value – minimum value
    For the following dataset of students' ages: 17, 15, 14, 16, 14, 15, 16, 12, 17, 13, 12, 17, 13, 16, 15

    Maximum value
    Minimum value
    = 17
    = 12

    Range= maximum value – minimum value
    = 17 – 12
    = 5

    The range of the student's ages is 5 years.

    Quartiles divide data into four equal groups. Using the example of 15 students above, we have the following ordered dataset: 12, 12, 13, 13, 14, 14, 15, 15, 15, 16, 16, 16, 17, 17, 17. We can divide this set into four equal sized groups with each group containing one quarter of the data:

    • The first quartile (Q1) is the value that 25% of the data is below.
    • The second quartile (Q2) is the value that 50% of the data is below. This is the same as the median.
    • The third quartile (Q3) is the value that 75% of the data is below.
    In the example:Q1 = 13
    Q2 = 15
    Q3 = 16

    Interquartile range

    The interquartile range refers to the middle 50% of data. Another way to put it is the interquartile range is the difference between the upper (75%) and lower (25%). The interquartile range is an indicator of the spread of the data. It eliminates the influence of outliers since the highest and lowest quarters are removed. The interquartile range is found by subtracting Q1 from Q3.

    Five number summary (quartiles)
    This is a useful way to summarise data. It consists of:

    • the lowest value
    • the highest value
    • the first quartile (Q1)
    • the third quartile (Q3)
    • the second quartile (Q2).
    The range can be found from the difference between the highest and lowest value. The median is the second quartile (Q2) and the interquartile range is the difference between the third and first quartiles (Q3 – Q1).

    Standard deviation
    Standard deviation (s) is the measure of spread most commonly used when the mean is the measure of centre. Standard deviation is most useful for symmetric distributions with no outliers.
    The standard deviation for a discrete variable made up of n observations is the positive square root of the variance as show in Figure 3.

    n+1 divided by 2

    Image: Unimodal, bimodal and multimodal distributions

    Fig 1 Unimodal, bimodal and multimodal

    Equation: Five number summary (quartiles)
    Fig 2 Quartiles

    Equation: Standard deviation
    Fig 3 Standard deviation formula

    List of items in each category

    Commonwealth of Australia 2008

    Unless otherwise noted, content on this website is licensed under a Creative Commons Attribution 2.5 Australia Licence together with any terms, conditions and exclusions as set out in the website Copyright notice. For permission to do anything beyond the scope of this licence and copyright terms contact us.