Written by Steve
More than half of the employers in the UK, and likely elsewhere, are looking to recruit employees with data analysis and communication skills and so it is becoming more important than ever to ensure that you understand Statistics and what they can do. This has become so prevalent that many new courses, including those in sciences and humanities, are including data analysis techniques. The new A Level Maths courses are no different and are focusing around 15 – 20% of the course on using and interpreting a large data set. “Big Data” may sound like a scary concept but it’s not. In the”3 V’s” definition offered by Doug Laney in 2001, Big Data refers to Data Volume, Data Velocity, and Data Variety. With respect to A Level Maths, we look at how to deal with the Big Volume aspect, using functions on your calculator and software such as EXCEL and GeoGebra. To address data problems in your A Levels, all you need are these programs applied to few simple steps and you will be able to tackle any exam question on your data set.
Firstly, make sure you know what the data is. Find a copy of the data relevant to your exam, save it to your computer and study it. (You can download copies from the exam board websites: Edexcel; AQA; OCR). You need to make sure you know the data well. Each of the exam boards has said that “students who have studied the data set in advance will have an advantage”, so make sure that you are one of these students.
Think about what you already know about this topic. It is important that you don’t forget what you know about the subject from your everyday life and general knowledge. For example, the OCR data set relates to travel. Remember that children cannot drive and are unlikely to be travelling alone. You also know that bus and train services in some areas of the country are less reliable than others. This can be useful when interpreting the data and communicating necessary assumptions.
Decide which statistical calculations you want to perform. It is important that you don’t just calculate every statistical variable that you know without some sense of purpose. You just don’t have time and it does not show your understanding of techniques. For example, if you want to prove a relationship between two variables, it is of little use to find their standard deviations, separately. You would be better off drawing a scatterplot or calculating a correlation coefficient.
At this point it is worth remembering that you are working with a large data set and so doing these calculations by hand is going to prove very time consuming and possibly even unreliable. So make sure you know how to use your computer software to help you. You should know how to use software packages such as EXCEL and GeoGebra to calculate averages, quartiles and standard deviations. You should know how to plot a basic scatterplot and find a line of best fit. You should be able to get your computer to plot histograms and box plots. This will allow you to focus on the meaning of the data and what the techniques show. Of course you need to know how to do the basic calculations and diagrams yourself on a small-scale level, so that you can understand what is happening, but for the large data set, you need to use these resources.
You may want to try taking samples from your data set and “cleaning”* it, as this is most likely what you will be asked to work with in the final exams. Remember that they cannot ask you to do lengthy calculations with the entire data set in the exam; it is simply not long enough! Instead they will ask you to perform calculations on smaller samples and then ask you to comment on the results using your knowledge and understanding of the complete data set. For example, AQA may ask you to find the mean amount of milk sold per year between 2001 and 2010 in the North of the UK and then ask how this relates to the sales of milk in the rest of the country. You will not be expected to calculate the milk sales for each region in the exam, but you will be expected to know that overall the sales in the North are less than those in the South, and make a statement about why this is so.
Once you have completed your calculations and familiarised yourself with the various types of diagrams, you need to interpret your findings. This means that you need to understand what each of the calculations does and how it relates to your data, and then explain your understanding. It is important that you refer to the context of the data so that you can show you have really understood what you have been doing and that your evidence supports your findings. It is this interpretation that will be the difference between students achieving the higher grades and those who don’t. For example, the EDEXCEL data is about the weather. So you might say that the standard deviations of the rainfall per month is greater by 1.5 in Lerwick than in London which means that the rainfall in Lerwick is much less consistent. A higher achieving student should then go on to try and suggest why this may be the case using real world knowledge. For example, Lerwick is an island whereas London is inland and so protected from much of the inclement weather suffered in the North Sea.
So now you have an idea of what using the large data set entails. Basically you need to find it, play with it and practise using it to support various hypotheses. Examiners are looking for you to show that you understand statistical techniques and can interpret findings in context. Enjoy getting to know your data set!
*Cleaning a data set means removing anomalies and checking the data for accuracy
Be sure to give us a shout if you have any questions!