Friday, February 17, 2012

Among the Many Things Wrong With International Achievement Comparisons


Gene V Glass

The Brown Center for Education Policy of the Brookings Institution released a report just a couple days ago with the jaw-breaking title “HOW WELL ARE AMERICAN STUDENTS LEARNING? With sections on predicting the effect of the Common Core State Standards, achievement gaps on the two NAEP tests, and misinterpreting international test scores.” (See for yourself: The report is penned by Tom Loveless, a researcher from whom one might have expected nothing but analysis aimed at proving the abject failure of American public education. In this instance, he has disappointed any followers seeking such a message, and has in fact produced a fairly balanced analysis of the many pitfalls in basing policy on numbers.

My attention was drawn to the section on “misinterpreting international test scores,” since I have long felt that these international assessments are a mess of uninterpretable numbers providing a full-employment program for psychometricians, statisticians, and journalists. Loveless took a close look at PISA (Program for International Assessment). He concluded that policy makers, educators, journalists, and the public in general often arrive at “dubious conclusions of causality” based on the results of such assessments. Of course. Just recall the grand exodus to Japan in the 1980s when our nation was discovered to be “at risk” and the Japanese economy was booming, just before the Japanese economy tanked. And what did emissaries discover as the secret to education excellence? Jukus (privately operated “cram” businesses), high suicide rates among young people who were subjected to immense high stakes pressure, and an economy about ready to go into the dumpster. Now all eyes are on Finland. The whole scene is reminiscent of the IRA (International Reading Assessment on the 1970s) that showed that the top nation in the world on reading was …are you ready?...Italy. Italy?? That one sent people to Rome for a few months until it was discovered that the attempt at random sampling in the assessment was never so badly compromised as it was in Italy.

Loveless also pointed out that the numbers (averages of a nationwide sample of students) on which rankings of nations are based are frequently so close that there is no “statistical significance” to the differences. True, but let’s ignore this problem so as not to be diverted into an alley of dry mathematical mumbo jumbo.

But wait a minute. There is something far more wrong with these international assessments and comparisons than anyone seems interested in talking about. Think! A reading test that compares students in dozens of countries. The obvious question is “In what language is the test written?” And the obvious answer is “In the language of that nation.” But who is drawing the obvious conclusion? How in heaven’s name can you construct a reading test in dozens of different languages (English, Hungarian, Norwegian, and yes, Finnish) and be confident that the test is equally difficult in all of these languages? Well, the answer is that you can’t. It should be perfectly obvious to anyone who thinks about it for more than five minutes that it is impossible. And all the ministrations and obfuscations of the companies and consultants who make or supplement their living off of such stuff do not change that fact.

Let’s take a look at some of these results. I have excerpted some data from the 2003 PISA Reading test for 15 year-olds. They are merely illustrative and it’s of no consequence that it is a small subset of the complete results.

2003 PISA Reading 15 year-olds
Finland 543
Canada 528
Liechtenstein 525
Sweden 514
Hong Kong 510
Norway 500
Japan 498
Poland 497
France 496
USA 495
Germany 491
Austria 491
Hungary 482
Spain 481
Italy 476

So there is the USA a point or two or three below France and Poland and Japan and whoever, and a point or two above Germany and Austria. This is the kind of statistical insignificance that Loveless was talking about. However, even to take seriously the kinds of differences like 19 points between the US and Sweden ignores the question before us: How do you write a reading test in English and then translate it into Swedish (or vice versa) and end up confident that one is not intrinsically more difficult than the other? I insist that the answer to that question is that you can’t. And to claim that one has done so merely sweeps under the rug a host of concerns that include grammatical structure, syntax, familiarity of vocabulary, not to mention culture of the students taking the test.

Now the keepers of the PISA tests have produce a lengthy—almost 40-page–Appendix to a report that in which they claim to have solved the problem of producing equivalent translations by the assiduous application of the finest psychometric theories. ( I don’t believe it. Forget about DIF analysis, i.e., Differential Item Functioning which only tosses out a few items that show really large differences in difficulty between two forms and ignores consistent though small differences.) What the PISA technical manual omits are any examples of reading test items in two or three different languages so that we might scrutinize the results of all this fine theory. (Ironically, one keeper of the items declined to release a few examples to me even though that person was my own doctoral student some 30 years ago.)

So let’s look at an example of our own. Tom Sawyer setting up Huck Finn to whitewash the fence.

Tom appeared on the sidewalk with a bucket of whitewash and a long-handled brush. He surveyed the fence, and all gladness left him and a deep melancholy settled down upon his spirit. Thirty yards of board fence nine feet high. Life to him seemed hollow, and existence but a burden.
And now, a translation into German (since I majored in German as an undergrad some 50+ years ago):
Tom ist auf dem Bürgersteig mit einem Eimer von Tünche und einer langbehandelten Bürste erscheinen. Er hat den Zaun vermessen, und alle Freude hat ihn verlassen und eine tiefe Melancholie aud sein Geist geberuhigt wird. Dreißig Höfe von Ausschusszaun neun Füße hoch. Leben, zu dem ihn Hohlraum gescheinen hat, und Existenz aber eine Last.
Now we could observe multiple difficult choices that would have to be made in translating the English to the German that would surely affect the ability of a student to comprehend a sentence, phrase or the entire passage. Just a few: “whitewash” being Tünche in German is a relative obscure word? Is it equally obscure in American English or Canadian English (who are prone to speak of Scotch tape as “cello”); Melancholie might just as well be translated as Traurigkeit, depending on the local preferences for Latinate vs Germanic roots, such preferences still being strong in certain locales; etc. And what should we do with the 30-yard long fence? Translate it as a fence that is 27.432 meters long?

Bottom line: I believe that the differences in difficulty produced by the vagaries of translating a reading test across several languages are at least as large as many of the differences among average PISA Reading test scores, the latter differences being the stuff of media accounts as well as learned papers on school reform.

To bolster my belief, along comes an actual piece of education research addressed to precisely the translation question in international reading comparisons. A recent article in the Scandanavian Journal of Educational Research by Inga Arffman carries the title “Equivalence of translations in international reading literacy studies,” (Vol. 54, No. 1, 37-59). The paper summarizes a study that examined the problems encountered in translating texts in international reading assessments. And in spite of the fact that Arffman is a faculty member of the University of Jyväskylä in Finland—which has every motive possible to believe that PISA Reading assessments are the most valid tests in the history of psychometrics—the conclusion of the research is that “ will probably never be possible to attain full equivalence of difficulty in international reading literacy studies….” Amen.


  1. I just wanted to point out that the German translation is horrible (I'm a native speaker). PISA scholars should be able to answer the question of equivalent translations: I'd agree with that and join in with the criticism if they can't. But I find it difficult to see how any home-made translation would succeed in undermining trust in PISA results. Yes, it is difficult to convey the same meaning in a different language, especially when, for reasons of comparativeness, lexicon and passive vocabulary are essential. Deeming this outright impossible, however, is not helpful. There's a myriad of translations out there and we know that at least some of them are very, very good, be it James Joyce or some children's book. (My English may not be flawless, too, sorry for that. But the Tom Sawyer translation is virtually incomprehensible.)

  2. It is not a matter of the fidelity of a translation. It is a matter of producing psychometric equivalence right down to percentage points of difficulty between two items. Even small differences in item difficulty between two items in different languages accumulated across several items could produce differences between two nations of the magnitude observed for many of the nations in these international rankings. To place one's trust in the PISA scholars to have solved a problem so fraught with complexities as equalizing the cognitive load in two different languages (Finnish and Hungarian?) strikes me as naive.