Test results “down by one percentage point”

 

I’m not going to write a huge amount about yesterday’s national results for key stage 2 English, maths and science, (well, actually, it seems I have) but one thought does leap out.

A great deal seems to be being made about a one percentage point drop in the results for English. The proportion of pupils reaching the “expected” level four or above edged down, from 81 to 80 per cent. This is the first time the headline English results have fallen since the introduction of the tests in 1995. It  prompted, I am told, a front page story in London’s Evening Standard and a mention near the top of most other articles.

Michael Gove, Shadow Secretary for Children, Schools and Families, was quoted as saying: “We have seen a historic drop in English results, the brightest students are not being stretched, and the weakest are being failed the most. It is deeply worrying that English results are in decline.”

But it seems to me that it is foolish to try to make any conclusion on the basis of a one percentage point fall from one year to the next in national test results. The data are simply not trustworthy enough to allow one to report confidently that this represents a fall in standards. This is not necessarily the fault of the process by which pass marks are set, which as scientific as it probably can be. It just reflects the reality of, probably, any testing system.

An inquiry for the Government by Sir Jim Rose back in 1999 into the setting of the pass marks “or level thresholds”, in the jargon, for national tests  contained the following interesting insight. It said: “An enormous amount of technical and statistical expertise is brought to bear on designing the tests and making them consistent with the national curriculum standards expected of pupils at the end of each key stage, year-on-year. Nevertheless…there will always be a degree of subjectivity in what is done, for example, to agree the level thresholds, and in the judgements of markers when marking questions.”

Rose then highlighted how, in discussions about where to set the level threshold or pass mark needed to achieve level 3 in English in 1999, there was a disagreement of five marks between the mark at which marking experts believed it should be set, and that suggested by statistical analysis. Eventually, human judgement prevailed. At level four, there was a difference of one mark in the two suggested pass marks.

Rose concluded: “Where such small margins are involved, it becomes obvious that testing is not an exact science. The justification for choosing one ‘pass mark’ over another can be barely discernible.”

Yet, precisely where the pass mark is set can have a large impact on the percentages of pupils achieving any given level or not. When I analysed data from 2005, I found that a move of two marks in the level four threshold would lead to around a three percentage point swing in the proportions reaching that benchmark.

A report on level-setting for KS3 science in 2007, conducted by the OCR exam board for the Qualifications and Curriculum Authority and quoted in a chapter by Paul Newton of Cambridge Assessment, in a book just published, is similarly interesting. It shows how the proportion of pupils gaining a particular level could  have varied by up to five percentage points according to whether it was decided to set the pass mark at the lower end of a confidence interval suggested by statistical modelling, or at the higher.

In other words, as Rose concluded, this is an imprecise science. Attempting to have a national debate around a rise or fall of one percentage point is highly unwise.

It may be that the one percentage point fall represents a drop in national standards. Or it may just be down to the unavoidable imprecision of the level- setting process. We cannot know for sure. And this need not prompt a flinging up of hands and a demand that testing become more accurate, in my view*, but a reconsideration of the weight being placed on test data, not least in informing any consideration of the effectiveness of the Government’s education policies.

This point is to be made before one even gets into whether the tests themselves – painstakingly constructed though they undoubtedly are – represent good measures of the overall quality of our schools, or just the ability of primaries to cram pupils for what is, after all, just a subset of one of the qualities one might want from education: to prepare pupils to perform in a series of time-limited tests in three subjects.

And the fact that the figures overall represent a – if taken at face value – quite startling improvement since the mid-1990s, which has been maintained in recent years – tends to be underplayed. If 80 per cent of pupils are now reaching the standard that used to be attained by the “average” child, is this truly a cause for national hand-wringing? Moreover, the achievement of level 3 or 4 has never been the difference between being able to read and write or do maths that the reports, often encouraged by the Government, suggest. Of course, achieving level 4  comes down to whether or not one has accumulated enough marks in a particular test on a particular day, with one mark sometimes making the difference between level 4 (ie the pupil “can read”) or level 3 (“they cannot”, is the implication). The cut-off is simply not robust enough to support the current high-stakes use of the indicators.  

Overall, if this is how the Government is held accountable, it is a very superficial form of accountability, which does little either to enhance public understanding of what is going on in schools, or to promote genuine improvements.

* Although I would be interested to hear if another testing mechanism could be created which would allow one to be more certain that a one percentage point fall in the numbers achieving the threshold represented a genuine change in the underlying standards of performance.

Leave a Reply

Your email address will not be published. Required fields are marked *