League table research: some factual mistakes and overdone conclusions

Tuesday, November 2nd

My instinctive reaction, on learning of research published today by Bristol University’s Centre for Market and Public Organisation showing GCSE results have improved faster in England than in Wales in recent years was “surprise, surprise”. And what, exactly, does this prove?

Take two school systems.

Imagine that, in system A, huge effort and political attention is paid to improving schools’ results, on a few centrally-designed performance indicators. League tables are published which centre on those indicators, inspection judgements centre on them, schools which do badly are repeatedly threatened with closure, heads at such schools are very closely monitored by their local authorities, and numerous other effects, from performance related pay to rewards for heads of improving schools, accentuate the focus on these results numbers.

In system B, the mechanism for focusing on these indicators is not quite so developed. One change in recent years has been the decision by those responsible for overseeing this system not to publish this performance indicator information in school league tables, collated centrally. System B in general, though, also seems to be less overwhelmingly orientated towards pushing schools to improve on a particular metric of school performance, as measured by exam results.

Suppose, also, that academics come along and look at how the two systems have performed, as measured by results performance indicators, and find that results in system A have risen faster than those in system B, on these metrics. This, the research would seem to imply, shows that system A has “improved” faster, in a general sense, than system B. Moreover, the research also claims that one difference between the two systems, the lack of league tables in system B, explains that “performance” gap. It even goes further, arguing that the very fact of the removal of the league tables from system B has actually reduced pupils’ exam performance in this system, even though results had actually improved there, too, though not as fast as in system A.

Such an analysis would, of course, be highly problematic. It is hardly surprising that if you make one indicator of performance the be-all-and-end-all for schools, as has happened in the case of England in recent years with the proportion of pupils achieving five A*-C grades at GCSE (later changed to include a focus on English and maths because of concerns about schools “gaming” the system to focus on “easier” courses) that they will focus very closely on improving on that particular measure.

But, as many including the economist Charles Goodhart (of “Goodhart’s Law”) fame, have suggested, the key question is whether this signifies true underlying improvement, or just that, when you implicitly threaten people with losing their jobs if particular statistics do not rise, they are likely to rise. Research which shows only that the measures by which people are being judged have risen, then, really shows very little other than that people, when effectively forced to do so, can focus on what is being measured.

The Bristol study does indeed take improvements in the GCSE results statistics in England (system A), which has had league tables for many years, and compare them with those of Wales (system B) since the Principality dispensed with them for secondary schools in 2001. It argues that because results in England have risen faster than in Wales, that league tables work, saying: “We find systematic, significant and robust evidence that abolishing school league tables markedly reduced school effectiveness in Wales.” Further than this, the press release I have received today in association with the research says the study shows “naming and shaming schools works”.

But the Bristol research, which covers the period 2000 to 2008, was not quite so simplistic as to argue that only because the results on England’s published GCSE indicators have gone up faster in recent years than those in Wales, schools here (I am based in England) must have improved, in a general sense, at a faster rate than in Wales.

It does acknowledge the problem alluded to above: if you want to prove this in the way the academics want to, you must have an alternative set of indicators, against which schools were not being judged publicly over the period, or else all your research will show is an increased focus in England on the measurement mechanism itself.(Note 1)

When I read further into the paper to learn it was using a second set of measures to try to show this, some of my fears about it were assuaged. This would be a more sophisticated analysis, I thought.

Unfortunately, there are problems with this second indicator, too. It appears the research team have used for this second measure a “value-added” metric of school performance(Note 2), which looks at each pupil’s key stage 3 results, and then compares them to the pupil’s average GCSE points score two years later. By this measure, it is possible to arrive at figures for the performance of schools in England, and to compare them to those in Wales.

The paper implies that because this second measure is not “published” in league tables, schools will not focus on it to the same degree as they did for the five A*-C measure at GCSE, and so that it provides a cross-check as to whether the improvements of one system (England’s) have come simply through an obsessive focus on one particular indicator.

Although the team is right to look for this cross-check, there seem to me to be factual errors in what they say about the second indicator which they use. There are two problems, one less important and one major.

To take the less significant one first, it is not true to say, as the research team seems to be saying that average point score indicators have not been published under England’s system. As league table information I have looked at going back to the year 2000(or before) shows, average GCSE points scores for individual schools have been published for parents to look at if they want to since at least that time (see, for example, government information on what went in the tables in 2001 here, under “how the results are reported). The researchers might contend that value-added information of the precise sort they use (ie measuring KS3 to GCSE results in the way described above) has not been published, but a glance at information published on the education department’s performance tables website would suggest that it has been, at least for part of this period. In any case, if the main “outcome” or end measure of this indicator is which average points score each school ends up with, then that has been published, on a school-by-school basis, in English league tables for a long time. So this might have influenced schools’ actions in terms of trying to boost results on this indicator, making it not very good as an alternative measure of underlying education quality.

I say this is a minor point, though, because I would agree with the paper with the implication that the main statistical influence on schools’ actions in England over the period under consideration has been the five A*-C indicator. It is the published indicator that they have been focused on, as any reader of this site, or anyone familiar with secondary schools in England, will be aware. (Although, strangely, the research team does not acknowledge that the five A*-C indicator was itself superseded, in England’s league table system, by five A*-Cs including English and maths in 2007, a year before the final data which seems to have bee n used in this study).

The major issue is what looks to me like a factual howler on page 18 and 19 of the report. This considers whether England’s GCSE results could have been boosted by the use of non-GCSE “equivalent” qualifications in its official calculations. The paper concedes that non-GCSE courses have been included in its “GCSE” calculations (because these are based on official government figures which include them). It says that “In addition to the regular GCSE qualifications, schools in both England and Wales have used GCSE-equivalent qualifications, typically more vocational qualifications and more frequently used for less academically able children”. It adds though, that this is a relatively recent phenomenon, saying “the new equivalent qualifications were introduced in England in 2005 and Wales in 2007 as part of the restructuring of the curriculum between the ages of 14 and 19”.

It then uses this fact as one half of the explanation why GCSE-equivalent qualifications do not explain England’s more-improved performance, pointing out that there is no disjunction in performance trends before and after 2005, presumably as might be expected with the introduction of such qualifications into performance measures.

The claim that “equivalent” qualifications were introduced in 2005 is wrong. “Equivalent” qualifications have been around far longer than that, in the form of General National Vocational Qualifications (GNVQs). I remember investigating how some English schools boosted their results, understandably, perhaps, given the pressures on them, through GCSE-equivalent Intermediate GNVQs, since at least 2003. Here is the link to an article a former colleague of mine at the TES wrote back in 2001. Moreover, a glance again at the Government’s rules for calculating schools’ average GCSE points scores (see, again, here) from, say, 2001, shows that GNVQs were being counted, for performance information purposes, as “worth” up to four GCSEs back then. That is, they appear to have featured in the indicators used for this research throughout the period, not just for part of it.

The reason the researchers suggest this phenomenon only started in 2005 can be traced to a footnote in their paper, giving a link in which the Qualifications and Curriculum Authority spoke of “turning the government’s vision for reform of 14-19 learning into a range of effective qualifications, programmes and initiatives”. It is true that the range of qualifications which counted in league tables was expanded in 2005, alongside these reforms. But the GNVQ route was already well-established by then as a well-known way some schools sought to transform their league table standings through an equivalent qualification, worth multiple GCSEs.

The other half of the paper’s explanation as to why any greater reliance of English schools on “alternative” of “GCSE-equivalent” courses to drive up their results is insignificant as a factor behind differential results patterns in the two countries is that, statistically, GCSE-equivalent courses do not amount to much in the figures over this period, accounting for only 9.5 per cent of total scores among the lowest-performing schools, and 4.3 per cent among median-performing schools in 2006. I have no way of checking that statistic, of course, though I do wonder how significant that 10 per cent effect on low-performing schools is, given that another conclusion of the paper is that schools with generally low-attaining pupils in Wales have been particularly hard hit by the decision there to abolish league tables (ie their performance has suffered, in a general sense, compared to England because league tables are no longer published).

Moveover, GCSE “equivalent” courses, as demonstrated several times in recent years, (see, for example, here) have certainly had a big impact on some schools’ published league table results, including those towards the bottom of the rankings which have been subjected to just the “naming and shaming” approach the paper ends up advocating.

The underlying point is that, while this second indicator is meant to provide a cross-check ruling out the contention that schools in England have been artificially (understandably, in my view, given the pressures on them) boosting results by concentrating on narrow performance metrics, in fact some tactics used by schools to do just this would seem to me to have had the effect of boosting performance on both measures. They are, then, not really independent variables at all.

Examples of tactics which could boost both measures would include the use of these non-GCSE qualifications, as discussed above – which would boost both the main five or more A*-C measure and average pupil points scores – and teaching geared very precisely to the contents of particular exams. A better alternative measure might be something which was much more divorced from the centrally published performance indicator, such as performance in other tests than GCSEs, although even this might reflect in part a greater stress on the importance of test performance – as opposed to other aspects of education – in one system’s schools as compared to those of another. But no other measure is used in the paper.

Pulling back from this somewhat now, there is the problem that the paper takes two countries and then notes that one policy – secondary league table publication – changed in one of those countries and then seeks to assign causation for changes in the exam results of these two countries to that change.

This is a bold attempt, and it is understandable that academics should be looking for the “natural experiment” that the paper suggests is created by parallel policy developments in England and Wales. They point out that previous attempts to investigate the effectiveness of school accountability have been hampered by the lack of a “control group”: a set of schools which have not had certain accountability mechanisms, to be considered alongside those that have.

The academics have looked at Wales, with the cessation of secondary league tables, compared this to what has happened in England and thought that this will make for a good experiment, with one aspect of school accountability present in one country, but not in another.

The trouble is that it’s a stretch too far. For while league tables did indeed change in Wales, they are only one element of what look like different approaches to school accountability. Many other aspects of policy have also changed between the two countries over the period.

The academics, of course, do consider, in some detail to be fair, whether more than one variable in this “experiment” might be changing. While noting that Wales’s schools have tended to be poorer funded than England’s, for example, they say they are able to control for this as a factor, effectively saying, if I have understood them correctly, that they can show that English schools have improved more, under their indicators, than those in Wales, even after taking into account possible effects of funding differentials. Fair enough.

They also consider whether inspection regimes might have changed in one country compared to the other over the period. But this is dismissed, the paper saying that “during the period of our study there were no major differences between the England and Wales inspection regimes”, seemingly based on a single study in 2008 by the academic David Reynolds.

My hypothesis is that English inspection systems, which have become very data-focused in recent years, might be encouraging schools here to obsess about their results to a greater degree than those in Wales. This, then, would be an alternative or additional explanation alongside league tables in explaining differences in results improvements between the two countries. To be fair, I cannot be sure, as my knowledge of the Welsh inspection system is sketchy, certainly when compared to what I know about England’s.

But, again, this is not discussed in detail in the paper as a possible alternative explanation. I find it strange that, in this section, there is no mention that the English inspection system actually changed dramatically during the period of the study, from one that was based on longer visits to schools in which classroom observation was important, to one, introduced in 2005, in which inspections focused relentlessly on test and result data. For references to this, see articles here and here, and chapter 16 of my book. Again, it is creditable of the researchers to acknowledge this as a potential extra factor, but why no detail?

The truth is that league tables are only one element – admittedly a very high profile and always controversial element – of the accountability system which has worked to push schools to prioritise results improvements in England in recent years.

Other changes could also have driven the results improvements in England – such as the increasing availability of exam board endorsed textbooks which facilitate very close teaching to the exam, and other support from the boards such as training for teachers in what examiners are looking for; the use of direct threats of intervention from Government and local authorities to schools which are underperforming (for an example of how this pressure was stepped up in the years covering the study, see here); and the advent of School Improvement Partners, who as far as I can work out have been going around English schools emphasising to school leaders, in recent years, just how important it is that they improve their pupils’ exam results.

Have such changes been taking place in Welsh schools to exactly the same degree, so that we can be confident that league tables can be isolated as the main reason for any improvement in results which follows? Well, I would like to see convincing evidence.(Note 3)

Pulling even further back from this, I suspect that it just is not true that all that has changed in a meaningful way between England and Wales over the intervening years –looking outside of accountability itself- has been the publication of secondary league tables. Changes such as the introduction of the national literacy strategy in English schools are considered in the paper, but discounted as an alternative explanatory factor, but policy change in England goes beyond this. It was completely hyper-active over the period, so isolating the effects of one particular policy is a perilous game.

Right, that’s my analysis done of this paper’s evidence and arguments. But probably my main concern with it is how such research – premised as I believe it is on ultimately thin grounds – is used. It will be used to promote a view that league tables “work”, on the basis of at best partial statistical analysis and, really, given the other evidence which is available, not much else – and even that tough policies such as “naming and shaming” schools are right.

It also, obviously, promotes a view, inadvertently or not, that all that matters in education is exam results. Do the public really share this opinion that exam results are all? Of course, many will think they are important. But if there has been a national debate about this, and the public have concluded that good exam grades are all that matters, I have missed it. It is a tremendous shame if those who view grades as the best, most “precise”, and most convenient way of measuring schools help, inadvertently or not, to encourage this opinion for that reason.

The paper does admit, in one paragraph on page 22, that “Of course, teachers and schools may have broader aims than GCSE exam results for their pupils. These are not measured in our data so we can say nothing about the potential impact on (ceasing) publishing performance information on these broader educational outcomes, nor on any potential impact on teacher and headteacher motivation and morale”.

I would submit – and it pains me to say this as I have had some contact in the past with two of the research team and have respected the expertise behind their work, but I am going to say it anyway – that this is just not good enough from serious researchers. This is especially the case, given the boldness of the claims being made on the basis of this statistical analysis: that league table reform in Wales has reduced the effectiveness of schools there, and that naming and shaming schools “works”. Not “reduced schools’ effectiveness according to particular measures of pupil exam performance”, or “works in terms of driving up performance on exam indicators”, but “reduced schools’ effectiveness” and “works” full stop.

(It could be argued, here, that “effectiveness”, when used by statisticians to describe school performance, has been defined and used in quite a precise way to mean “effectiveness as measured by particular results indicators”. But this is not how it is interpreted by those outside of the statistical community reading this report).

In essence, the position is: “there may be other things to consider than exam results, but we haven’t bothered doing so in any depth”, possibly because these qualities cannot easily be measured statistically, although this is not specified in the paper and would be contentious. It seems no further explanation is needed as to why such an investigation was not at least a part of the analysis. And I say that despite the researchers clearly having gone to the trouble of investigating potential caveats and limitations within their research.

For all the copious statistical expertise clearly involved in this analysis, research which essentially only considers performance statistically in this way cannot provide the whole picture of what has been going on in schools in the two countries.

It is providing at best a part of the picture of what has been going on. The other part could only be provided through detailed qualitative research, which you have to say is lacking here. If English schools have improved, how have they done it, for example? Do teachers simply not bother to teach effectively without league tables, do league tables make them work harder, or what? Have there been any side-effects, for pupils, parents and teachers? (I can point, of course, to evidence of many, included subtle effects such as claims that certain “academic” subjects such as languages can struggle to compete for pupils under accountability pressures facing schools to maximise published scores).There is no analysis here, with the numbers supposed, in effect, to serve as sufficient evidence in themselves of general improvement. Yet there are powerful examples around of a rich qualitative approach to finding out what has been really been going on in English education in recent years, the best of which, I think, in recent years has been the hugely detailed Cambridge Primary Review.

It seems to me there is a division, in education research, between academics who trained as economists, and therefore are trained to be fluent with numbers and multi-level modelling, and those focusing more on the qualitative.

Because this research is very much of the former variety, with lots of formulae, it will probably be taken seriously. Numbers, even without, I believe, their full context, are very powerful, and can be assumed to be showing the “truth” about a particular phenomenon. They are also, I believe, seductive, in offering relatively easy-to- collect evidence on school “effectiveness”, without necessitating hugely detailed qualitative study. They are not “fuzzy” in a qualitative sense. And many people whose eyes glaze over at the sight of a complicated results equation will simply take it on trust that the economist-education academic is reaching the right conclusions.

The truth, of course, is usually more complicated than can be demonstrated through formulae alone. My worry is that these findings will be used to push forward a particular view of education reform, implying in the minds of reformers that getting schools to focus relentlessly on exam success –however measured – is a good thing, that there are no negative impacts on what goes on in schools arising from centrally collated Government performance tables and that hard-hitting policies which many will argue end up ultimately undermining public education, such as “naming and shaming”, are the way to go.

This report does not demonstrate the effectiveness of these policies in any general sense. It certainly seems to me to be very far from proof that “league tables work”. But its findings will be taken in just such a way. I think education deserves better.

Note 1: This of course, has one obvious problem in that schools can focus on improvements on the published indicators through concentrating, for example, on pupils just on the borderline of achieving the particular indicator, at the expense of other learners. More generally, measures could be taken to “game” the system such that schools improve on the measured indicator, but not in any fundamental or wider sense.

Note 2: Actually, it is not completely clear to me that this is the alternative measure being used, after reading the paper, or whether the team did not use a value-added measure at all here, simply comparing pupil performance as measured by “raw” average GCSE points scores. But whichever measure is being used does not change the argument above, because average points scores did feature in published league tables during this period for individual schools.

Note 3: It could be argued, of course, that the differential approaches to improving performance in English and Welsh schools really do not matter; all that really matters is that England has somehow managed to generate faster-improving scores than Wales. These results are really important to young people, so essentially it does not matter how they have been generated, or what they signify.I don’t buy this argument, I’m afraid, since GCSE results are a relative, not an absolute, good to young people. Pupils in England – most of whom will be competing against other pupils from English schools in the employment and higher education market after they finish school – will only benefit from a national improvement in results if these truly signify underlying improvements in the quality of what they understand, as supposedly measured by the results, since pupil A does not benefit if he or she then finds that, though his or her results have improved, so have those of pupil B, against whom he or she is competing in the employment or higher education market, because of a national results improvement. That is, results are useful to a pupil in themselves in a relative sense – ie to the degree they confer advantage over competitors – rather than in an absolute sense: pupils do not gain, relative to others, if results for all pupils improve.

4 Comments

The Daily Digest(ive) November 3rd 2010 | Creative Education Blog

10 years ago

I have included your post in my Daily Digest of educational blog posts as I thought it would be of interest to other teachers. You can see it here: http://bit.ly/a345eQ

@creativeedu

10 years ago


A thorough and appropriate consideration of the case. A shame that the original researchers didn’t manage that!

Tafkam

10 years ago


“Systemic failure” in Welsh Education « CMPO Viewpoint

10 years ago

The Tyranny of Testing

League table research: some factual mistakes and overdone conclusions

4 Comments

Leave a Reply Cancel reply