Wednesday, May 8, 2024

Class Notes: Relationship of Education Policy to Education Research and Social Science

2005

Notes to the Proseminar: Relationship of
Education Policy to Education Research and Social Science

These notes have two purposes: to disabuse na├»ve conceptions of the role of research in policy development (if you happen to have any, and perhaps you don’t); to illustrate and conceptualize the actual relationship of social science and educational research to policy.

Since we are working in a graduate program that has training for a career of scholarship and research as its goal, and since it is education policy that is the subject of this scholarly attention, it should not be unexpected that at some point we would ask how these two things come together. There is much more to policy making than doing or finding research that tells one how to make policy—far more. In fact, basing policy on research results—or appearing to do so—is a relatively recent phenomenon. Policies are formulated and adopted for many more reasons than what educational and social science research give. Policies may arise from the mere continuation of traditional ways of doing things; they may come about to satisfy a group of constituents to whom one owes one’s elected or appointed position. They may grow out of the mind of an authority in a way that is completely incomprehensible, even to the leader. But increasingly, policies are being viewed by opinion makers, journalists, intellectuals of various sorts, and even a broader public as not legitimate or not deserving of respect or lacking authority unless they are linked somehow to science. In what follows immediately, I have written down a few ideas on this subject. They may orient your thinking in a way that makes the articles you will read more meaningful. It is also my hope that they will help us understand what went on in the exercise last week when we carried out a simulation of how a policy is influenced by research.

In brief, I believe that research is far too limited in its scope of application to dictate policy formulation. Policy options arise in large part from the self interests of particular groups. No matter how broad and solid the research base, opponents of a policy recommendation that seems to arise from a particular body of research evidence will find ground on which to stand where the evidence is shaky or non-existent, and from there they will oppose the recommendation. If “whole language” instruction appears from all available research to result in greater interest in reading and more elective reading in adulthood, opponents will argue that children so taught will be disadvantaged in learning to read a second language or that no solid measures of comprehension were taken thirty years later in adulthood.

So what is the relationship of research to policy? The answer entails distinctions between such things as codified policy (laws, rules & regulations, by-laws and the like) and policy-in-practice (regularities in the behavior or people and organizations that rise to a particular level of importance—usually because they entail some conflict of interests or values--where we care to talk about them as “policy”).

Policy-in-practice is often shaped by research in ways that are characterized by the following:

  • Long lag times between the “scientific discovery” and its effect on practice;
  • Transmission of “scientific truths” through popular media, folk knowledge, and personal contact;
  • Largely unconscious or unacknowledged acceptance of such truths in the forms of common sense and widely shared metaphors.
Thus do we move from views of children that prevailed in the 19th century—tabula rasa, sinful, responding to corporal punishment, having “mental muscles” needing exercise—to views that prevail in the 21st century—learning through rewards either by operant (Skinnerian) principles or by need reduction, having fragile “self-concepts” that must be nurtured to produce high “self-esteem,” etc.—thanks to Thorndike, Freud, Skinner and the rest. (It may be worth remarking upon at this point that research powerful enough to change prevailing views is rare, resembles what Kuhn has called a “paradigm shift”—although Kuhn quite clearly observed that the social sciences are in a “pre-paradigm state” so that there could be no paradigm shifts in the sense he spoke of them—and is usually rejected as wrong when first presented precisely because it challenges prevailing views. I hasten to add, however, that the vast majority of ideas that challenge the prevailing view are in fact wrong.)

With respect to codified policy, research serves functions of legitimating choices, which it can do on account of its respected place in modern affairs. Research is used rhetorically in policy debates. Not to advance research in support of one’s position is often tantamount to conceding the debate—as if the prosecution put its expert on the stand and the defense had no expert. In rare instances so recondite and far from the public’s ability to understand and in instances where a small group of individuals exercises strict control (“contexts of command” as Cronbach and his associates spoke of them) or in places where the stakes are so low that almost no one cares about them, research may actually determine policy. But even these contexts are surprisingly and increasingly rare. (Research evidence is marshaled against such seemingly “scientific” practices as vaccinating against flu viruses, preventing forest fires, the use of antibiotics, tonsillectomies, and cutting fat out of your diet.)

For the most part, research in education functions in a “context of accommodation” where interests conflict and the stakes are relatively high. There, research virtually never determines policy. Instead, it is used in the adversarial political process to advance one’s cause. Ultimately, the policy will be determined through democratic procedures (direct or representative) because no other way of resolving the policy conflicts works as well. Votes are taken as a way of forcing action and tying off inquiry, which never ends. (As an observer once remarked concerning trials, they end because people get tired of talking; if they were all conducted in writing—as research is—they would never end.)

Research that is advanced in support of codified policy formation is at the other end of the scale from the kind of scientific discoveries earlier referred to that revolutionize prevailing views. They resemble what Kuhn called “normal science”—small investigations that function entirely within the boundaries of well-established knowledge and serve to reinforce prevailing views. Such work is properly regarded less as “scientific” inquiry than as other things: rhetoric, testimony for the importance of a set of ideas or concerns, existence demonstrations (“This is possible.”). [Having written this paragraph in the first draft, now, in the second draft, I have no idea what I was driving at.]

Research does not determine policy in the areas in which we are interested—human services, let’s call them—in large part because such research is open ended, i.e., its concerns are virtually unlimited. Such research lacks a “paradigm” in the strict Kuhnian sense: not in the sense in which the word has come to mean virtually everything and nothing in popular pseudo-intellectual speech, but “paradigm” in the sense that Kuhn used it, meaning having an agreed upon set of concepts, problems, measures and methods. On one of the rare occasions when Kuhn was asked whether educational research had a paradigm, or any recent “paradigm shifts,” he seemed barely to understand the question. In writing (Structure of Scientific Revolutions), he opined that even psychology was in a “pre-paradigm” state. Without boundaries, a body of research supporting policy A can always be said to be missing elements X, Y and Z, that just happen at the moment to be of critical importance to those who dislike the implications of whatever research has been advanced in support of policy A.

This lack of a “paradigm” for social scientific and educational research not only makes them suffer from an inability to limit the number of considerations that can be claimed to bear on any one problem, it also means that there are no guidelines on what problems the research will address. The questions that research addresses do not come from (are not suggested by) theories or conceptual frameworks themselves, but rather reflect the interests of the persons who choose them. For example, administrators in one school district choose to study “the culture of absenteeism” among teachers; in so doing they ignore the possibility that sabbaticals for teachers might be a worthy and productive topic for research. The politics of this situation are almost too obvious to mention.

What are some functions of research in the process of policy formation?

a) Researchers give testimony in the legal sense in defending a particular position in adversarial proceedings, wherever these occur. (There even exists an online journal named Scientific Testimony, http://www.scientific.org/.) Both sides of a policy debate will parade their experts, giving conflicting testimony based on their research. But not to appear and testify is to give up the game to the opposition.

b) Researchers educate (or at least, influence) decision makers by giving them concepts and ways of thinking that are uncommon outside the circles of social scientists who invent and elaborate them. An example: in the early days of the Elementary and Secondary Education Act of 1965 (the first significant program of federal aid to education), the Congressional hearings for reauthorization of the law were organized around the appearance of various special interest groups: the NEA, the AFT, the vocational education lobby, and so forth. Paul Hill, now a education policy professor at the University of Washington, was a highly placed researcher/scholar in the National Institute of Education in the early 1970s. His primary responsibility was for evaluation of Title I of ESEA, the Compensatory Education portion of the law. Through negotiating and prodding and arguing, he succeeded in having the Congressional hearings for reauthorization organized around a set of topics related to compensatory education that reflected the research and evaluation community’s view of what was important: class size, early childhood developmental concerns, teacher training, and the like. No one can point to a study or a piece of research that affected Congress’s decisions about compensatory education; but for a time, the decision makers talked and thought about the problems in ways similar to have the researchers thought about them.

c) Researchers give testimony (in the “religious” sense of a public acknowledgment or witnessing) to the importance of various ideas or concepts simply by merit of involving them in their investigations. (“Maybe this is important, or else why would all those pinheads be talking about it?”)

Tuesday, August 29, 2023

Still, Done Too Soon

Michael John Scriven (28 March 1928 – 28 August 2023)

Bob Stake hired me at Urbana in 1965. Bob was editing the AERA monograph series on Curriculum Evaluation. He shared a copy of a manuscript he was considering. It was “The Methodology of Evaluation.” Up to that point, evaluation for educators was about nothing much more than behavioral objectives and paper-&-pencil tests. Finally though, someone was talking sense about something I could get excited about. At that point I only heard rumors, some true, some not, about the author: he was a philosopher; he was Australian; his parents were wealthy sheep ranchers; he was moving from Indiana to San Francisco; he told a realtor that he wanted an expensive house with only one bedroom.

I didn’t meet Michael in person until about 1968. I had moved to Boulder, and he was attending a board meeting of the Social Sciences Education Consortium. He had helped start SSEC back in Indiana in 1963, and it had moved to Colorado in the meantime. I had nothing to do with SSEC but somehow was invited to the dinner at the Red Lion Inn. I knew Michael would be there, and I was eager to see this person in the flesh. He arrived and a dinner of a dozen or so commenced. As people were seated, Michael began to sing in Latin a portion of some Catholic mass. I had no idea what it was about, but it was clear that he was amused by the reaction of his companions. At one point in the table talk, someone congratulated an economist in attendance on the birth of his 7th child. “A true test of masculinity,” someone loudly remarked. “Hardly, in an age of contraceptives,” said Michael soto voce. It was 1968 after all.

We next met in 1969. I had the contract from the US Office of Education – it was not a Department yet – to analyze and report the data from the first survey of ESEA Title I, money for the disadvantaged. The contract was large as was my “staff.” I was scared to death. I called in consultants: Bob Stake, Dick Jaeger; but Michael was first. He calmed me down and gave me a plan. I was grateful.

We met again in 1972. It was at AERA in New York. He invited me up to the room to meet someone. It was Mary Anne. She was young; she was extraordinarily beautiful. I was speechless. Those who knew Michael only recently – say, post 1990 – may not have known how handsome and charming he was.

I saw Michael rarely post-1980. His interest in evaluation became his principal focus and my interests wandered elsewhere. One day when I found myself analyzing the results of other people’s analyses, I thought of Michael and “meta-evaluation” (literally the evaluation of evaluations) and decided to call what I was doing, meta-analysis. Very recently, I wrote him and told him that he was responsible for the term “meta-analysis.” I was feeling sorry for him; it was the only thing I could think to say that might make him feel a bit better. I probably overestimated.

In the late 1990s, Sandy and I were in San Francisco and Michael invited us to Inverness for lunch. Imbedded in memory are a half dozen hummingbird feeders, shellfish salad, and the library – or should I say, both libraries. When the house burned down and virtually everything was lost, I remembered the library. When Michael’s Primary Philosophy was first published in 1966, I bought what turned out to be a first printing. Unknown to Michael and many others, apparently, there was an interesting typo. Each chapter’s first page was its number and its title, e.g., III ART. However, on page 87, there was only the chapter number IV. The chapter name was missing: GOD. After the house burned down and the libraries were lost, I sent him my copy of Primary Philosophy; "Keep it." He was amused and grateful.

There was a meeting of Stufflebeam’s people in Kalamazoo around 2000 perhaps. Michael was in charge. I was asked to speak. I can barely remember what I said; maybe something about personally and privately held values versus values that are publicly negotiated. I could tell that Michael was not impressed. It hardly mattered. He invited Sandy and me to see his house by a lake. There were traces that his health was not good.

I can’t let go of the notion that there are some things inside each of us that drive us and give us a sense of right-and-wrong and good-better-best that one might as well call personal values. They are almost like Freud’s super-ego, and they are acquired in the same way, by identification with an object (person) loved or feared. I know I have a very personal sense of when I am doing something right or well. A part of that sense is Michael.

Thursday, July 21, 2022

A Memory

Ghosts and Reminiscences:
My Last Day on Earth as a "Quantoid"

Gene V Glass
Arizona State University

I was taught early in my professional career that personal recollections were not proper stuff for academic discourse. The teacher was my graduate adviser Julian Stanley, and the occasion was the 1963 Annual Meeting of the American Educational Research Association. Walter Cook, of the University of Minnesota, had finished delivering his AERA presidential address. Cook had a few things to say about education, but he had used the opportunity to thank a number of personal friends for their contribution to his life and career, including Nate Gage and Nate's wife; he had spoken of family picnics with the Gages and other professional friends. Afterwards, Julian and Ellis Page and a few of us graduate students were huddled in a cocktail party listening to Julian's post mortem of the presidential remarks. He made it clear that such personal reminiscences on such an occasion were out of place, not to be indulged in. The lesson was clear, but I have been unable to desist from indulging my own predilection for personal memories in professional presentations. But that early lesson has not been forgotten. It remains as a tug on conscience from a hidden teacher, a twinge that says "You should not be doing this," whenever I transgress.

 Bob Stake and I and Tom Green and Ralph Tyler (to name only four) come from a tiny quadrilateral no more than 30 miles on any side in Southeastern Nebraska, a fertile crescent (with a strong gradient trailing off to the northeast) that reaches from Adams to Bethany to South Lincoln to Crete, a mesopotamia between the Nemaha and the Blue Rivers that had no more than 100,000 population before WW II. I met Ralph Tyler only once or twice, and both times it was far from Nebraska. Tom Green and I have a relationship conducted entirely by email; we have never met face-to-face. But Bob Stake and I go back a long way. 

 On a warm autumn afternoon in 1960, I was walking across campus at the University of Nebraska headed for Love Library and, as it turned out, walking by chance into my own future. I bumped into Virgina Hubka, a young woman of 19 at the time, with whom I had grown up since the age of 10 or 11. We seldom saw each other on campus. She was an Education major, and I was studying math and German with prospects of becoming a foreign language teacher in a small town in Nebraska. I had been married for two years at that time and felt a chronic need of money that was being met by janitorial work. Ginny told me of a job for a computer programmer that had just been advertised in the Ed Psych Department where she worked part time as a typist. A new faculty member—just two years out of Princeton with a shiny new PhD in Psychometrics—by the name of Bob Stake had received a government grant to do research. 

 I looked up Stake and found a young man scarcely ten years my senior with a remarkably athletic looking body for a professor. He was willing to hire a complete stranger as a computer programmer on his project, though the applicant admitted that he had never seen a computer (few had in those days). The project was a monte carlo simulation of sampling distributions of latent roots of the B* matrix in multi-dimensional scaling—which may shock latter-day admirers of Bob's qualitative contributions. Stake was then a confirmed "quantoid" (n., devotee of quantitative methods, statistics geek). I took a workshop and learned to program a Burroughs 205 computer (competitor with the IBM 650); the 205 took up an entire floor of Nebraska Hall, which had to have special air conditioning installed to accommodate the heat generated by the behemoth. My job was to take randomly generated judgmental data matrices and convert them into a matrix of cosines of angles of separation among vectors representing stimulus objects. It took me six months to create and test the program; on today's equipment, it would require a few hours. Bob took over the resulting matrix and extracted latent roots to be compiled into empirical sampling distributions. 

 The work was in the tradition of metric scaling invented by Thurstone and generalized to the multidimensional case by Richardson and Torgerson and others; it was heady stuff. I was allowed to operate the computer in the middle of the night, bringing it up and shutting it down by myself. Bob found an office for me to share with a couple of graduate students in Ed Psych. I couldn't believe my good luck; from scrubbing floors to programming computers almost overnight. I can recall virtually every detail of those two years I spent working for Bob, first on the MDS project, then on a few other research projects he was conducting (even creating Skinnerian-type programmed instruction for a study of learner activity; my assignment was to program instruction in the Dewey Decimal system). 

 Stake was an attractive and fascinating figure to a young man who had never in his 20 years on earth traveled farther than 100 miles from his birthplace. He drove a Chevy station wagon, dusty rose and silver. He lived on the south side of Lincoln, a universe away from the lower-middle class neighborhoods of my side of town. He had a beautiful wife and two quiet, intense young boys who hung around his office on Saturdays silently playing games with paper and pencil. In the summer of 1961, I was invited to the Stake's house for a barbecue. Several graduate students were there (Chris Buethe, Jim Beaird, Doug Sjogren). The backyard grass was long and needed mowing; in the middle of the yard was a huge letter "S" carved by a lawn mower. I imagined Bernadine having said once too often, "Bob, would you please mow the backyard?" (Bob's children tell me that he was accustomed to mowing mazes in the yard and inventing games for them that involved playing tag without leaving the paths.) 

 That summer, Bob invited me to drive with him to New York City to attend the ETS Invitational Testing Conference. Bob's mother would go with us. Mrs. Stake was a pillar of the small community, Adams, 25 miles south of Lincoln where Bob was born and raised. She regularly spoke at auxiliary meetings and other occasions about the United Nations, then only 15 years old. The trip to New York would give her a chance to renew her experiences and pick up more literature for her talks. Taking me along as a spare driver on a 3,500 mile car trip may not have been a completely selfless act on Bob's part, but going out of the way to visit the University of Wisconsin so that I could meet Julian Stanley and learn about graduate school definitely was generous. Bob had been corresponding with Julian since the Spring of 1961. The latter had written his colleagues around the country urging them to test promising young students of their acquaintance and send him any information about high scores. In those pre-GRE days, the Miller Analogies Test and the Doppelt Mathematical Reasoning Test were the instruments of choice. Julian was eager to discover young, high scorers and accelerate them through a doctoral program, thus preventing for them his own misfortune of having wasted four of his best years in an ammunition dump in North Africa during WW II—and presaging his later efforts to identify math prodigies in middle school and accelerate them through college. Bob had created his own mental ability test, named with the clever pun QED, the Quantitative Evaluative Device. Bob asked me to take all three tests; I loved taking them. He sent the scores to Julian, and subsequently the stop in Madison was arranged. Bob had made it clear that I should not attend graduate school in Lincoln. 

 We drove out of Lincoln—the professor, the bumpkin and Adams's Ambassador to the U.N.—on October 27, 1961. Our first stop was Platteville, Wisconsin, where we spent the night with Bill Jensen, a former student of Bob's from Nebraska. Throughout the trip we were never far from Bob's former students who seemed to feel privileged to host his retinue. On day two, we met Julian in Madison and had lunch at the Union beside Lake Mendota with him and Les McLean and Dave Wiley. The company was intimidating; I was certain that I did not fit in and that Lincoln was the only graduate school I was fit for. We spent the third night sleeping in the attic apartment of Jim Beaird, whose dissertation that spring was a piece of the Stake MDS project; he had just started his first academic job at the University of Toledo. The fourth day took us through the Allegheny Mountains in late October; the oak forests were yellow, orange and crimson, so unlike my native savanna. We shared the driving. Bob drove through rural New Jersey searching for the small community where his brother Don lived; he had arranged to drop off his mother there. The maze was negotiated without the aid of road maps or other prostheses; indeed, none was consulted during the entire ten days. That night was spent in Princeton. Fred Kling, a former ETS Princeton Psychometric Fellow at Princeton with Bob, and his wife entertained us with a spaghetti dinner by candlelight. It was the first time in my life I had seen candles on a dinner table other than during a power outage, as it was also the first time I had tasted spaghetti not out of a can.  .

 The next day we called on Harold Gulliksen at his home. Gulliksen had been Bob's adviser at Princeton. We were greeted by his wife, who showed us to a small room outside his home office. We waited a few minutes while he disengaged from some strenuous mental occupation. Gulliksen swept into the room wearing white shirt and tie; he shook my hand when introduced; he focused on Bob's MDS research. The audience was over within fifteen minutes. I didn't want to return to Princeton. 

 We drove out to the ETS campus. Bob may have been gone for three years, but he was obviously not forgotten. Secretaries in particular seemed happy to see him. Bob was looking for Sam Messick. I was overwhelmed to see that these citations—(Abelson and Messick, 1958)—were actual persons, not like anything I had ever seen in Nebraska of course, but actual living, breathing human beings in whose presence one could remain for several minutes without something disastrous happening. Bob reported briefly on our MDS project to Messick. Sam had a manuscript in front of him on his desk. "Well, it may be beside the point," Messick replied to Bob's description of our findings. He held up the manuscript. It was a pre-publication draft of Roger Shepard's "Analysis of Proximities," which was to revolutionize multidimensional scaling and render our monte carlo study obsolete. It was October 30, 1961. It was Bob Stake's last day on earth as a quantoid. 

 The ETS Invitational Testing Conference was held in the Roosevelt Hotel in Manhattan. We bunked with Hans Steffan in East Orange and took the tube to Manhattan. Hans had been another Stake student; he was a native German and I took the opportunity to practice my textbook Deutsch. I will spare the reader a 21-year-old Nebraska boy's impressions of Manhattan, all too shopworn to bear repeating. The Conference was filled with more walking citations: Bob Ebel, Ledyard Tucker, E. F. Lindquist, Ted Cureton, famous name after famous name. (Ten years later, I had the honor of chairing the ETS Conference, which gave me the opportunity to pick the roster of speakers along with ETS staff. I asked Bob to present his ideas on assessment; he gave a talk about National Assessment that featured a short film that he had made. People remarked that they were not certain that he was being "serious." His predictions about NAEP were remarkably prescient.) 

 We picked up Bob's mother in Harrisburg, Pennsylvania, for some reason now forgotten. While we had listened to papers, she had invaded and taken over the U.N. We pointed the station wagon west; we made one stop in Toledo to sleep for a few hours. I did more than my share behind the wheel. I was extremely tired, having not slept well in New York. Bob and I usually slept in the same double bed on this trip and I was too worried about committing some gross act in my sleep to rest comfortably. I had a hard time staying awake during my stints at the wheel, but I would not betray weakness by asking for relief. I nearly fell asleep several times through Ohio, risking snuffing out two promising academic careers and breaking Adams, Nebraska's only diplomatic tie to the United Nations. 

 To help relieve the boredom of the long return trip, Bob and I played a word game that he had learned or invented. It was called "Ghost." Player one thinks of a five-letter word, say "spice." Player two guesses a five-letter word to start; suppose I guessed "steam." Player one superimposes, in his mind, the target word "spice" and my first guess "steam" and sees that one letter coincides—the "s." Since one letter is an odd number of letters, he replies "odd." If no letters coincide he says "even." If I had been very lucky—actually unlucky—and first guessed "slice," player one would reply "even" because four letters coincide. (This would actually have been an unlucky start since one reasonably assumes that the initial response "even" means that zero letters coincide. I think that games of this heinous intricacy are not unknown to Stake children.) Through a process of guessing words and deducing coincidences from "odd" and "even" responses, player two eventually discovers player one's word. It is a difficult game and it can consume hundreds of miles on the road. Several rounds of the game took us through Ohio, Indiana, Illinois. Somewhere around the Quad Cities, Bob played his trump card. He was thinking of a word that resisted all my most assiduous attempts at deciphering. Finally, outside Omaha I conceded defeat. His word was "ouija," as in the board. Do we take this incident as in some way a measure of this man? 

By the time I arrived in Lincoln, a Western Union Telegram from Julian was waiting. I had never before received a telegram—or known anyone who had. I was flattered; I was hooked. Three months later, January 1962, I left Lincoln, Stake and everything I had known my entire life for graduate school. Bob and I corresponded regularly during the ensuing years. He wrote to tell me that he had taken a job at Urbana. I told him I was learning all that was known about statistics. He wrote several times during his summer, 1964, at Stanford in the institute that Lee Cronbach and Richard Atkinson conducted. Clearly it was a transforming experience for him. I was jealous. When I finished my degree in 1965, Bob had engineered a position for me in CIRCE at Univ. of Illinois. I was there when Bob wrote his "Countenance" paper; I pretended to understand it. I learned that there was a world beyond statistics; Bob had undergone enormous changes intellectually since our MDS days. I admired them, even as I recognized my own inability to follow. I spent two years at CIRCE; I think I felt the need to shine my own light away from the long shadows. I picked a place where I thought I might shine: Colorado. 

 Bob and I saw very little of each other from 1967 on. In the early 1970s, I invited him to teach summer school at Boulder. He gave a seminar on evaluation and converted all my graduate students into Stake-ians. But I saw little of him that summer. We didn't connect again until 1978. 

 When the year 1978 arrived, I was at the absolute height of my powers as a quantoid. My book on time-series experiment analysis was being reviewed by generous souls who called it a "watershed." Meta- analysis was raging through the social and behavioral sciences. I had nearly completed the class-size meta-analysis. The Hastings Symposium, on the occasion of Tom Hastings's retirement as head of CIRCE, was happening in Urbana in January. I attended. Lee Cronbach delivered a brilliant paper that gradually metamorphosed into his classic Designing Evaluations of Educational and Social Programs. Lee argued that the place of controlled experiments in educational evaluation is much less than we had once imagined. "External validity," if we must call it that, is far more important than "internal validity," which is after all not just an impossibility but a triviality. Experimental validity can not be reduced to a catechism. Well, this cut to the heart of my quantoid ideology, and I remember rising during the discussion of Lee's paper to remind him that controlled, randomized experiments worked perfectly well in clinical drug trials. He thanked me for divulging this remarkable piece of intelligence. 

 That summer I visited Eva Baker's Center for the Study of Evaluation at UCLA for eight weeks. Bob came for two weeks at Eva's invitation. One day he dropped a sheet of paper on my desk that contained only these words:

 
Chicago  6
New York  5
Lincoln  6
Phoenix  8
Urbana  10
San Francisco  10
We were back to ghost, I could tell. I worked all day and half the night on it. I was stuck. Then I remembered that he was staying by himself in a bare apartment just off campus. When I visited it several days before, there had only been a couch, a phone and a phonebook in the living room. I grabbed a phonebook and started perusing it. There near the front was a list of city names and area codes: Chicago 312, New York 212, Lincoln 402; 3+1+2=6, 2+1+2=5, 4+0+2=6, etc. Bingo! He didn't get me this time. 

 I was a quantoid, and "what I do best" was peaking. I gave a colloquium at Eva's center on the class size meta-analysis in mid- June. People were amazed. Jim Popham asked for the paper to inaugurate his new journal Educational Evaluation and Policy Analysis. He was welcome to it. 

 June 30, 1978, dawned inauspiciously; I had no warning that it would be my last day on earth as a quantoid. Bob was to speak at a colloquium at the Center on whatever it was that was on his mind at that moment. Ernie House was visiting from Urbana. I was looking forward to the talk, because Bob never gave a dull lecture in his life. That day he talked about portrayal, complexity, understanding; qualities that are not yet nor may never be quantities; the ineffable (Bob has never been a big fan of the "effable"). I listened with respect and admiration, but I listened as one might listen to stories about strange foreign lands, about something that was interesting but that bore no relationship to one's own life. Near the end when questions were being asked I sought to clarify the boundaries that contained Bob's curious thoughts. I asked, "Just to clarify, Bob, between an experimentalist evaluator and a school person with intimate knowledge of the program in question, who would you trust to produce the most reliable knowledge of the program's efficacy?" I sat back confident that I had shown Bob his proper place in evaluation—that he couldn't really claim to assess impact, efficacy, cause-and-effect with his case-study, qualitative methods—and waited for his response, which came with uncharacteristic alacrity. "The school person," he said. I was stunned. Here was a person I respected without qualification whose intelligence I had long admired who was seeing the world far differently from how I saw it. 

 Bob and Ernie and I stayed long after the colloquium arguing about Bob's answer, rather Ernie and I argued vociferously while Bob occasionally interjected a word or sentence of clarification. I insisted that causes could only be known (discovered, found, verified) by randomized, controlled experiments with double-blinding and followed up with statistical significance tests. Ernie and Bob argued that even if you could bring off such an improbable event as the experiment I described, you still wouldn't know what caused a desirable outcome in a particular venue. I couldn't believe what they were saying; I heard it, but I thought they were playing Jesuitical games with words. Was this Bob's ghost game again? 

 Eventually, after at least an hour's heated discussion I started to see Bob and Ernie's point. Knowledge of a "cause" in education is not something that automatically results from one of my ideal experiments. Even if my experiment could produce the "cause" of a wonderful educational program, it would remain for those who would share knowledge of that cause with others to describe it to them, or act it out while they watched , or somehow communicate the actions, conditions and circumstances that constitute the "cause" that produces the desired effect. They—Bob and Ernie—saw the experimenter as not trained, not capable of the most important step in the chain: conveying to others a sense of what works and how to bring it about. "Knowing" what caused the success is easier, they believed, than "portraying" to others a sense for what is known. 

 I can not tell you, dear reader, why I was at that moment prepared to accept their belief and their arguments, but I was. What they said in that hour after Bob's colloquium suddenly struck me as true. And in the weeks and months after that exchange in Moore Hall at UCLA, I came to believe what they believed about studying education and evaluating schools: many people can know causes; few experiments can clarify causal claims; telling others what we know is the harder part. It was my last day on earth as a quantoid. 

 In the early 1970s, Bob introduced me to the writings of another son of Lincoln, Loren Eiseley, the anthropologist, academic and author, whom Wystan H. Auden once named as one of the leading poets of his generation. Eiseley wrote often about his experiences in the classroom; he wrote of "hidden teachers," who touch our lives and never leave us, who speak softly at the back of our minds, who say "Do this; don’t do that." 

 In his book The Invisible Pyramid, Eiseley wrote of "The Last Magician." "Every man in his youth—and who is to say when youth is ended?—meets for the last time a magician, a man who made him what he is finally to be." (p. 137) For Eiseley, that last magician is no secret to those who have read his autobiography, All the Strange Hours; he was Frank Speck, an anthropology professor at the University of Pennsylvania who was Eiseley's adviser, then colleague, and to whose endowed chair Eiseley succeeded upon Speck's retirement. (It is a curious coincidence that all Freudians will love that Eiseley's first published book was a biography of Fancis Bacon entitled The Man Who Saw Through Time; Francis Bacon and Frank Speck are English and German translations of each other.) 

 Eiseley described his encounter with the ghost of his last magician:

"I was fifty years old when my youth ended, and it was, of all unlikely places, within that great unwieldy structure built to last forever and then hastily to be torn down—the Pennsylvania Station in New York. I had come in through a side doorway and was slowly descending a great staircase ina slanting shaft of afternoon sunlight. Distantly I became aware of a man loitering at the bottom of the steps, as though awaiting me there. As I descended he swung about and began climbing toward me. 

 "At the instant I saw his upturned face my feet faltered and I almost fell. I was walking to meet a man ten years dead and buried, a man who had been my teacher and confidant. He had not only spread before me as a student the wild background of the forgotten past but had brought alive for me the spruce-forest primitives of today. With him I had absorbed their superstitions, handled their sacred objects, accepted their prophetic dreams. He had been a man of unusual mental powers and formidable personality. In all my experience no dead man but he could have so wrenched time as to walk through its cleft of darkness unharmed into the light of day. 

 "The massive brows and forehead looked up at me as if to demand an accounting of that elapsed time during which I had held his post and discharged his duties. Unwilling step by step I descended rigidly before the baleful eyes. We met, and as my dry mouth strove to utter his name, I was aware that he was passing me as a stranger, that his gaze was directed beyond me, and that he was hastening elsewhere. The blind eye turned sidewise was not, in truth, fixed upon me; I beheld the image but not the reality of a long dead man. Phantom or genetic twin, he passed on, and the crowds of New York closed inscrutably about him." (Pp. 137-8)

 Eiseley had seen a ghost. His mind fixed on the terror he felt at encountering Speck's ghost. They had been friends. Why had he felt afraid?
 "On the slow train running homeward the answer came. I had been away for ten years from the forest. I had had no messages from its depths.... I had been immersed in the postwar administrative life of a growing university. But all the time some accusing spirit, the familiar of the last wood-struck magician, had lingered in my brain. Finally exteriorized, he had stridden up the stair to confront me in the autumn light. Whether he had been imposed in some fashion upon a convenient facsimile or was a genuine illusion was of little importance compared to the message he had brought. I had starved and betrayed myself. It was this that had brought the terror. For the first time in years I left my office in midafternoon and sought the sleeping silence of a nearby cemetery. I was as pale and drained as the Indian pipe plants without chlorophyll that rise after rains on the forest floor. It was time for a change. I wrote a letter and studied timetables. I was returning to the land that bore me." (P. 139)
 Whenever I am at my worst —- rash, hostile, refusing to listen, unwilling even to try to understand -- something tugs at me from somewhere at the back of consciousness, asking me to be better than that, to be more like this person or that person I admire. Bob Stake and I are opposites on most dimensions that I can imagine. I form judgments prematurely; he is slow to judge. I am impetuous; he is reflective. I talk too much; perhaps he talks not enough. I change my persona every decade; his seemingly never changes. And yet, Bob has always been for me a hidden teacher. 

Note

This is the text of remarks delivered in part on the occasion of a symposium honoring the retirement of Robert E. Stake, University of Illinois—UC. May 9, 1998 in Urbana, Illinois.

References

Eiseley, Loren (1970). The Invisible Pyramid. New York: Scribner.
Eiseley, Loren (1975). All the Strange Hours. New York: Scribner

Saturday, July 2, 2022

Meta-analysis at 25: A Personal History

January 2000

Meta-Analysis at 25
A Personal History
Gene V Glass
Arizona State University

Email: glass@asu.edu

It has been nearly 25 years since meta-analysis, under that name and in its current guise made its first appearance. I wish to avoid the weary references to the new century or millenium—depending on how apocalyptic you're feeling (besides, it's 5759 on my calendar anyway)—and simply point out that meta-analysis is at the age when most things graduate from college, so it's not too soon to ask what accounting can be made of it. I have refrained from publishing anything on the topic of the methods of meta-analysis since about 1980 out of a reluctance to lay some heavy hand on other people's enthusiasms and a wish to hide my cynicism from public view. Others have eagerly advanced its development and I'll get to their contributions shortly (Cooper & Hedges, 1994; Hedges and Olkin, 1985; Hunter, Schmidt and Jackson, 1982).

Autobiography may be the truest, most honest narrative, even if it risks self-aggrandizement, or worse, self-deception. Forgive me if I risk the latter for the sake of the former. For some reason it is increasingly difficult these days to speak in any other way.

In the span of this rather conventional paper, I wish to review the brief history of the form of quantitative research synthesis that is now generally known as "meta-analysis" (though I can't possibly recount this history as well as has Morton Hunt (1997) in his new book How Science Takes Stock: The story of Meta-Analysis), tell where it came from, why it happened when it did, what was wrong with it and what remains to be done to make the findings of research in the social and behavioral sciences more understandable and useful.

Meta-analysis Beginnings

In 25 years, meta-analysis has grown from an unheard of preoccupation of a very small group of statisticians working on problems of research integration in education and psychotherapy to a minor academic industry, as well as a commercial endeavor (see http://epidemiology.com/ and http://members.tripod.com/~Consulting_Unlimited/, for example). A keyword web search—the contemporary measure of visibility and impact—(Excite, January 28, 2000) on the word "meta-analysis" brings 2,200 "hits" of varying degrees of relevance, of course. About 25% of the articles in the Psychological Bulletin in the past several years have the term "meta-analysis" in the title. Its popularity in the social sciences and education is nothing compared to its influence in medicine, where literally hundreds of meta-analyses have been published in the past 20 years. (In fact, my internist quotes findings of what he identifies as published meta-analyses during my physical exams.) An ERIC search shows well over 1,500 articles on meta-analyses written since 1975.

Surely it is true that as far as meta-analysis is concerned, necessity was the mother of invention, and if it hadn't been invented—so to speak—in the early 1970s it would have been invented soon thereafter since the volume of research in many fields was growing at such a rate that traditional narrative approaches to summarizing and integrating research were beginning to break down. But still, the combination of circumstances that brought about meta-analysis in about 1975 may itself be interesting and revealing. There were three circumstances that influenced me.

The first was personal. I left the University of Wisconsin in 1965 with a brand new PhD in psychometrics and statistics and a major league neurosis—years in the making—that was increasingly making my life miserable. Luckily, I found my way into psychotherapy that year while on the faculty of the University of Illinois and never left it until eight years later while teaching at the University of Colorado. I was so impressed with the power of psychotherapy as a means of changing my life and making it better that by 1970 I was studying clinical psychology (with the help of a good friend and colleague Vic Raimy at Boulder) and looking for opportunities to gain experience doing therapy.

In spite of my personal enthusiasm for psychotherapy, the weight of academic opinion at that time derived from Hans Eysenck's frequent and tendentious reviews of the psychotherapy outcome research that proclaimed psychotherapy as worthless—a mere placebo, if that. I found this conclusion personally threatening—it called into question not only the preoccupation of about a decade of my life but my scholarly judgment (and the wisdom of having dropped a fair chunk of change) as well. I read Eysenck's literature reviews and was impressed primarily with their arbitrariness, idiosyncrasy and high-handed dismissiveness. I wanted to take on Eysenck and show that he was wrong: psychotherapy does change lives and make them better.

The second circumstance that prompted meta-analysis to come out when it did had to do with an obligation to give a speech. In 1974, I was elected President of the American Educational Research Association, in a peculiar miscarriage of the democratic process. This position is largely an honorific title that involves little more than chairing a few Association Council meetings and delivering a "presidential address" at the Annual Meeting. It's the "presidential address" that is the problem. No one I know who has served as AERA President really feels that they deserved the honor; the number of more worthy scholars passed over not only exceeds the number of recipients of the honor by several times, but as a group they probably outshine the few who were honored. Consequently, the need to prove one's worthiness to oneself and one's colleagues is nearly overwhelming, and the most public occasion on which to do it is the Presidential address, where one is assured of an audience of 1,500 or so of the world's top educational researchers. Not a few of my predecessors and contemporaries have cracked under this pressure and succumbed to the temptation to spin out grandiose fantasies about how educational research can become infallible or omnipotent, or about how government at national and world levels must be rebuilt to conform to the dictates of educational researchers. And so I approached the middle of the 1970s knowing that by April 1976 I was expected to release some bombast on the world that proved my worthiness for the AERA Presidency, and knowing that most such speeches were embarrassments spun out of feelings of intimidation and unworthiness. (A man named Richard Krech, I believe, won my undying respect when I was still in graduate school; having been distinguished by the American Psychological Association in the 1960s with one of its highest research awards, Krech, a professor at Berkeley, informed the Association that he was honored, but that he had nothing particularly new to report to the organization at the obligatory annual convention address, but if in the future he did have anything worth saying, they would hear it first.)

The third set of circumstances that joined my wish to annihilate Eysenck and prove that psychotherapy really works and my need to make a big splash with my Presidential Address was that my training under the likes of Julian Stanley, Chester Harris, Henry Kaiser and George E. P. Box at Wisconsin in statistics and experimental design had left me with a set of doubts and questions about how we were advancing the empirical agenda in educational research. In particular, I had learned to be very skeptical of statistical significance testing; I had learned that all research was imperfect in one respect or another (or, in other words, there are no "perfectly valid" studies nor any line that demarcates "valid" from "invalid" studies); and third, I was beginning to question a taken-for-granted assumption of our work that we progress toward truth by doing what everyone commonly refers to as "studies." (I know that these are complex issues that need to be thoroughly examined to be accurately communicated, and I shall try to return to them.) I recall two publications from graduate school days that impressed me considerably. One was a curve relating serial position of a list of items to be memorized to probability of correct recall that Benton Underwood (1957) had synthesized from a dozen or more published memory experiments. The other was a Psychological Bulletin article by Sandy Astin on the effects of glutamic acid on mental performance (whose results presaged a meta-analysis of the Feingold diet research 30 years later in that poorly controlled experiments showed benefits and well controlled experiments did not).

Permit me to say just a word or two about each of these studies because they very much influenced my thinking about how we should "review" research. Underwood had combined the findings of 16 experiments on serial learning to demonstrate a consistent geometrically decreasing curve describing the declining probability of correct recall as a function of number of previously memorized items, thus giving strong weight to an interference explanation of recall errors. What was interesting about Underwood's curve was that it was an amalgamation of studies that had different lengths of lists and different items to be recalled (nonsense syllables, baseball teams, colors and the like).

Astin's Psychological Bulletin review had attracted my attention in another respect. Glutamic acid—it will now scarcely be remembered—was a discovery of the 1950s that putatively increased the ability of tissue to absorb oxygen. Reasoning with the primitive constructs of the time, researchers hypothesized that more oxygen to the brain would produce more intelligent behavior. (It is not known what amount of oxygen was reaching the brains of the scientists proposing this hypothesis.) A series of experiments in the 1950s and 1960s tested glutamic acid against "control groups" and by 1961, Astin was able to array these findings in a crosstabulation that showed that the chances of finding a significant effect for glutamic acid were related (according to a chi-square test) to the presence or absence of various controls in the experiment; placebos and blinding of assessors, for example, were associated with no significant effect of the acid. As irrelevant as the chi-square test now seems, at the time I saw it done, it was revelatory to see "studies" being treated as data points in a statistical analysis. (In 1967, I attempted a similar approach while reviewing the experimental evidence on the Doman-Delacato pattern therapy. Glass and Robbins, 1967)

At about the same time I was reading Underwood and Astin, I certainly must have read Ben Bloom's Stability of Human Characteristics (1963), but its aggregated graphs of correlation coefficients made no impression on me, because it was many years after work to be described below that I noticed a similarlity between his approach and meta-analysis. Perhaps the connections were not made because Bloom dealt with variables such as age, weight, height, IQ and the like where the problems of dissimilarity of variables did not force one to worry about the kinds of problem that lie at the heart of meta-analysis.

If precedence is of any concern, Bob Rosenthal deserves as much credit as anyone for furthering what we now conveniently call "meta-analysis." In 1976, he published Experimenter Effects in Behvaioral Research, which contained calculations of many "effect sizes" (i.e., standardized mean differences) that were then compared across domains or conditions. If Bob had just gone a little further in quantifying study characteristics and subjecting the whole business to regression analyses and what-not, and then thinking up a snappy name, it would be his name that came up every time the subject is research integration. But Bob had an even more positive influence on the development of meta-analysis than one would infer from his numerous methodological writings on the subject. When I was making my initial forays onto the battlefield of psychotherapy outcome research—about which more soon—Bob wrote me a very nice and encouraging letter in which he indicated that the approach we were taking made perfect sense. Of course, it ought to have made sense to him, considering that it was not that different from what he had done in Experimenter Effects. He probably doesn't realize how important that validation from a stranger was. (And while on the topic of snappy names, although people have suggested or promoted several polysyllabic alternatives—quantitative synthesis, statistical research integration—the name meta- analysis, suggested by Michael Scriven's meta-evaluation (meaning the evaluation of evaluations), appears to have caught on. To press on further into it, the "meta" comes from the Greek preposition meaning "behind" or "in back of." Its application as in "metaphysics" derives from the fact that in the publication of Aristotle's writings during the Middle Ages, the section dealing with the transcendental was bound immediately behind the section dealing with physics; lacking any title provided by its author, this final section became known as Aristotle's "metaphysics." So, in fact, metaphysics is not some grander form of physics, some all encompassing, overarching general theory of everthing; it is merely what Aristotle put after the stuff he wrote on physics. The point of this aside is to attempt to leach out of the term "meta-analysis" some of the grandiosity that others see in it. It is not the grand theory of research; it is simply a way of speaking of the statistical analysis of statistical analyses.)

So positioned in these circumstances, in the summer of 1974, I set about to do battle with Dr. Eysenck and prove that psychotherapy—my psychotherapy—was an effective treatment. (Incidentally, though it may be of only the merest passing interest, my preferences for psychotherapy are Freudian, a predilection that causes Ron Nelson and other of my ASU colleagues great distress, I'm sure.) I joined the battle with Eysenck's 1965 review of the psychotherapy outcome literature. Eysenck began his famous reviews by eliminating from consideration all theses, dissertations, project reports or other contemptible items not published in peer-reviewed journals. This arbitrary exclusion of literally hundreds of evaluations of therapy outcomes was indefensible. It's one thing to believe that peer review guarantees truth; it is quite another to believe that all truth appears in peer reviewed journals. (The most important paper on the multiple comparisons problem in ANOVA was distributed as an unpublished ditto manuscript from the Princeton University Mathematics Department by John Tukey; it never was published in a peer reviewed journal.)

Next, Eysenck eliminated any experiment that did not include an untreated control group. This makes no sense whatever, since head-to-head comparisons of two different types of psychotherapy contribute a great deal to our knowledge of psychotherapy effects. If a horse runs 20 mph faster than a man and 35 mph faster than a pig, I can conclude with confidence that the man will outrun the pig by 15 mph. Having winnowed a huge literature down to 11 studies (!) by whim and prejudice, Eysenck proceeded to describe their findings soley in terms of whether or not statistical significance was attained at the .05 level. No matter that the results may have barely missed the .05 level or soared beyond it. All that Eysenck considered worth noting about an experiment was whether the differences reached significance at the .05 level. If it reached significance at only the .07 level, Eysenck classified it as showing "no effect for psychotherapy."

Finally, Eysenck did something truly staggering in its illogic. If a study showed significant differences favoring therapy over control on what he regarded as a "subjective" measure of outcome (e.g., the Rorschach or the Thematic Apperception Test), he discounted the findings entirely. So be it; he may be a tough case, but that's his right. But then, when encountering a study that showed differences on an "objective" outcome measure (e.g., GPA) bit no differences on a subjective measure (like the TAT), Eysenck discounted the entire study because the outcome differences were "inconsistent."

Looking back on it, I can almost credit Eysenck with the invention of meta-analysis by anti-thesis. By doing everything in the opposite way that he did, one would have been led straight to meta-analysis. Adopt an a posteriori attitude toward including studies in a synthesis, replace statistical significance by measures of strength of relationship or effect, and view the entire task of integration as a problem in data analysis where "studies" are quantified and the resulting data-base subjected to statistical analysis, and meta-analysis assumes its first formulation. (Thank you, Professor Eysenck.)

Working with my colleague Mary Lee Smith, I set about to collect all the psychotherapy outcome studies that could be found and subjected them to this new form of analysis. By May of 1975, the results were ready to try out on a friendly group of colleagues. The May 12th Group had been meeting yearly since about 1968 to talk about problems in the area of program evaluation. The 1975 meeting was held in Tampa at Dick Jaeger's place. I worked up a brief handout and nervously gave my friends an account of the preliminary results of the psychotherapy meta-analysis. Lee Cronbach was there; so was Bob Stake, David Wiley, Les McLean and other trusted colleagues who could be relied on to demolish any foolishness they might see. To my immense relief they found the approach plausible or at least not obviously stupid. (I drew frequently in the future on that reassurance when others, whom I respected less, pronounced the entire business stupid.)

The first meta-analysis of the psychotherapy outcome research found that the typical therapy trial raised the treatment group to a level about two-thirds of a standard deviation on average above untreated controls; the average person receiving therapy finished the experiment in a position that exceeded the 75th percentile in the control group on whatever outcome measure happened to be taken. This finding summarized dozens of experiments encompassing a few thousand persons as subjects and must have been cold comfort to Professor Eysenck.

An expansion and reworking of the psychotherapy experiments resulted in the paper that was delivered as the much feared AERA Presidential address in April 1976. Its reception was gratifying. Two months later a long version was presented at a meeting of psychotherapy researchers in San Diego. Their reactions foreshadowed the eventual reception of the work among psychologists. Some said that the work was revolutionary and proved what they had known all along; others said it was wrongheaded and meaningless. The widest publication of the work came in 1977, in a now, may I say, famous article by Smith and Glass in the American Psychologist. Eysenck responded to the article by calling it "mega-silliness," a moderately clever play on meta- analysis that nonetheless swayed few.

Psychologists tended to fixate on the fact that the analysis gave no warrant to any claims that one type or style of psychotherapy was any more effective than any other: whether called "behavioral" or "Rogerian" or "rational" or "psychodynamic," all the therapies seemed to work and to work to about the same degree of effectiveness. Behavior therapists, who had claimed victory in the psychotherapy horserace because they were "scientific" and others weren't, found this conclusion unacceptable and took it as reason enough to declare meta-analysis invalid. Non- behavioral therapists—the Rogerians, Adlerians and Freudians, to name a few—hailed the meta-analysis as one of the great achievements of psychological research: a "classic," a "watershed." My cynicism about research and much of psychology dates from approximately this period.

Criticisms of Meta-analysis

The first appearances of meta-analysis in the 1970s were not met universally with encomiums and expressions of gratitude. There was no shortage of critics who found the whole idea wrong-headed, senseless, misbegotten, etc.

The Apples-and-Oranges Problem

Of course the most often repeated criticism of meta-analysis was that it was meaningless because it "mixed apples and oranges." I was not unprepared for this criticism; indeed, I had long before prepared my own defense: "Of course it mixes apples and oranges; in the study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor worthy of true scientists; comparing apples to apples is trivial." But I misjudged the degree to which this criticism would take hold of people's opinions and shut down their minds. At times I even began to entertain my own doubts that it made sense to integrate any two studies unless they were studies of "the same thing." But, the same persons who were arguing that no two studies should be compared unless they were studies of the "same thing," were blithely comparing persons (i.e., experimental "subjects") within their studies all the time. This seemed inconsistent. Plus, I had a glimmer of the self-contradictory nature of the statement "No two things can be compared unless they are the same." If they are the same, there is no reason to compare them; indeed, if "they" are the same, then there are not two things, there is only one thing and comparison is not an issue. And yet I had a gnawing insecurity that the critics might be right. One study is an apple, and a second study is an orange; and comparing them is as stupid as comparing apples and oranges, except that sometimes I do hesitate while considering whether I'm hungry for an apple or an orange.

At about this time—late 1970s—I was browsing through a new book that I had bought out of a vague sense that it might be worth my time because it was written by a Harvard philosopher, carried a title like Philosophical Explanations and was written by an author—Robert Nozick—who had written one of the few pieces on the philosophy of the social sciences that ever impressed me as being worth rereading. To my amazement, Nozick spent the first one hundred pages of his book on the problem of "identity," i.e., what does it mean to say that two things are the same? Starting with the puzzle of how two things that are alike in every respect would not be one thing, Nozick unraveled the problem of identity and discovered its fundamental nature underlying a host of philosophical questions ranging from "How do we think?" to "How do I know that I am I?" Here, I thought at last, might be the answer to the "apples and oranges" question. And indeed, it was there.

Nozick considered the classic problem of Theseus's ship. Theseus, King of Thebes, and his men are plying the waters of the Mediterranean. Each day a sailor replaces a wooden plank in the ship. After nearly five years, every plank has been replaced. Are Theseus and his men still sailing in the same ship that was launched five years earlier on the Mediterranean? "Of course," most will answer. But suppose that as each original plank was removed, it was taken ashore and repositioned exactly as it had been on the waters, so that at the end of five years, there exists a ship on shore, every plank of which once stood in exactly the same relationship to every other in what five years earlier had been Theseus's ship. Is this ship on shore—which we could easily launch if we so chose—Theseus's ship? Or is the ship sailing the Mediterranean with all of its new planks the same ship that we originally regarded as Theseus's ship? The answer depends on what we understand the concept of "same" to mean?

Consider an even more troubling example that stems from the problem of the persistence of personal identity. How do I know that I am that person who I was yesterday, or last year, or twenty-five years ago? Why would an old high-school friend say that I am Gene Glass, even though hundreds, no thousands of things about me have changed since high school? Probably no cells are in common between this organism and the organism that responded to the name "Gene Glass" forty years ago; I can assure you that there are few attitudes and thoughts held in common between these two organisms—or is it one organism? Why then, would an old high-school friend, suitably prompted, say without hesitation, "Yes, this is Gene Glass, the same person I went to high school with." Nozick argued that the only sense in which personal identity survives across time is in the sense of what he called "the closest related continuer." I am still recognized as Gene Glass to those who knew me then because I am that thing most closely related to that person to whom they applied the name "Gene Glass" over forty years ago. Now notice that implied in this concept of the "closest related continuer" are notions of distance and relationship. Nozick was quite clear that these concepts had to be given concrete definition to understand how in particular instances people use the concept of identity. In fact, to Nozick's way of thinking, things are compared by means of weighted functions of constituent factors, and their "distance" from each other is "calculated" in many instances in a Euclidean way.

Consider Theseus's ship again. Is the ship sailing the seas the "same" ship that Theseus launched five years earlier? Or is the ship on the shore made of all the original planks from that first ship the "same" as Theseus's original ship? If I give great weight to the materials and the length of time those materials functioned as a ship (i.e., to displace water and float things), then the vessel on the shore is the closest related continuation of what historically had been called "Theseus's ship." But if, instead, I give great weight to different factors such as the importance of the battles the vessel was involved in (and Theseus's big battles were all within the last three years), then the vessel that now floats on the Mediterranean—not the ship on the shore made up of Theseus's original planks—is Theseus's ship, and the thing on the shore is old spare parts.

So here was Nozick saying that the fundamental riddle of how two things could be the same ultimately resolves itself into an empirical question involving observable factors and weighing them in various combinations to determine the closest related continuer. The question of "sameness" is not an a priori question at all; apart from being a logical impossibility, it is an empirical question. For us, no two "studies" are the same. All studies differ and the only interesting questions to ask about them concern how they vary across the factors we conceive of as important. This notion is not fully developed here and I will return to it later.

The "Flat Earth" Criticism

I may not be the best person to critique meta-analysis, for obvious reasons. However, I will cop to legitimate criticisms of the approach when I see them, and I haven't seen many. But one criticism rings true because I knew at the time that I was being forced into a position with which I wasn't comfortable. Permit me to return to the psychotherapy meta-analysis.

Eysenck was, as I have said, a nettlesome critic of the psychotherapy establishment in the 1960s and 1970s. His exaggerated and inflammatory statements about psychotherapy being worthless (no better than a placebo) were not believed by psychotherapists or researchers, but they were not being effectively rebutted either. Instead of taking him head-on, as my colleagues and I attempted to do, researchers, like Gordon Paul, for example, attempted to argue that the question whether psychotherapy was effective was fundamentally meaningless. Rather, asserted Paul while many others assented, the only legitimate research question was "What type of therapy, with what type of client, produces what kind of effect?" I confess that I found this distracting dodge as frustrating as I found Eysenck's blanket condemnation. Here was a critic—Eysenck—saying that all psychotherapists are either frauds or gullible, self-deluded incompetents, and the establishment's response is to assert that he is not making a meaningful claim. Well, he was making a meaningful claim; and I already knew enough from the meta-analysis of the outcome studies to know that Paul's question was unanswerable due to insufficient data, and that reseacrhers were showing almost no interest in collecting the kind of data that Paul and others argued were the only meaningful data.

It fell to me, I thought, to argue that the general question "Is psychotherapy effective?" is meaningful and that psychotherapy is effective. Such generalizations—across types of therapy, types of client and types of outcome—are meaningful to many people—policy makers, average citizens—if not to psychotherapy researchers or psychotherapists themselves. It was not that I necessarily believed that different therapies did not have different effects for different kinds of people; rather, I felt certain that the available evidence, tons of it, did not establish with any degree of confidence what these differential effects were. It was safe to say that in general psychotherapy works on many things for most people, but it was impossible to argue that this therapy was better than that therapy for this kind of problem. (I might add that twenty years after the publication of The Benefits of Psychotherapy, I still have not seen compelling answers to Paul's questions, nor is their evidence of researchers having any interest in answering them.)

The circumstances of the debate, then, put me in the position of arguing, circa 1980, that there are very few differences among various ways of treating human beings and that, at least, there is scarcely any convincing experimental evidence to back up claims of differential effects. And that policy makers and others hardly need to waste their time asking such questions or looking for the answers. Psychotherapy works; all types of therapy work about equally well; support any of them with your tax dollars or your insurance policies. Class size reductions work—very gradually at first (from 30 to 25 say) but more impressively later (from 15 to 10); they work equally for all grades, all subjects, all types of student. Reduce class sizes, and it doesn't matter where or for whom.

Well, one of my most respected colleagues called me to task for this way of thinking and using social science research. In a beautiful and important paper entitled "Prudent Aspirations for the Social Sciences," Lee Cronbach chastised his profession for promising too much and chastised me for expecting too little. He lumped me with a small group of like-minded souls into what he named the "Flat Earth Society," i.e., a group of people who believe that the terrain that social scientists explore is featureless, flat, with no interesting interactions or topography. All therapies work equally well; all tests predict success to about the same degree; etc.:

"...some of our colleagues are beginning to sound like a kind of Flat Earth Society. They tell us that the world is essentially simple: most social phenomena are adequately described by linear relations; one-parameter scaling can discover coherent variables independent of culture and population; and inconsistences among studies of the same kind will vanish if we but amalgamate a sufficient number of studies.... The Flat Earth folk seek to bury any complex hypothesis with an empirical bulldozer." (Cronbach, 1982, p. 70.)
Cronbach's criticism stung because it was on target. In attempting to refute Eysenck's outlandishness without endorsing the psychotherapy establishment's obfuscation, I had taken a position of condescending simplicity. A meta-analysis will give you the BIG FACT, I said; don't ask for more sophisticated answers; they aren't there. My own work tended to take this form, and much of what has ensued in the past 25 years has regrettably followed suit. Effect sizes—if it is experiments that are at issue—are calculated, classified in a few ways, perhaps, and all their variability is then averaged across. Little effort is invested in trying to plot the complex, variegated landscape that most likely underlies our crude averages.

Consider an example that may help illuminate these matters. Perhaps the most controversial conclusion from the psychotherapy meta-analysis that my colleagues and I published in 1980 was that there was no evidence favoring behavioral psychotherapies over non-behavioral psychotherapies. This finding was vilified by the behavioral therapy camp and praised by the Rogerians and Freudians. Some years later, prodded by Cronbach's criticism, I returned to the database and dug a little deeper. What I found appears in Figure 2. When the nine experiments extant in 1979—and I would be surprised if there are many more now—in which behavioral and non-behavioral psychotherapies are compared in the same experiment between randomized groups and the effects of treatment are plotted as a function of follow-up time, the two curves in Figure 2 result. The findings are quite extraordinary and suggestive. Behavioral therapies produce large short-term effects which decay in strength over the first year of follow-up; non- behavioral therapies produce initially smaller effects which increase over time. The two curves appear to be converging on the same long-term effect. I leave it to the reader to imagine why. One answer, I suspect, is not arcane and is quite plausible.

Figure 2, I believe, is truer to Cronbach's conception of reality and how research, even meta-analysis, can lead us to a more sophisticated understanding of our world. Indeed, the world is not flat; it encompasses all manner of interesting hills and valleys, and in general, averages do not do it justice.

Extensions of Meta-analysis

In the twenty-five years between the first appearance of the word "meta-analysis" in print and today, there have been several attempts to modify the approach, or advance alternatives to it, or extend the method to reach auxiliary issues. If I may be so cruel, few of efforts have added much. One of the hardest things to abide in following the developments in meta-analysis methods in the past couple of decades was the frequent observation that what I had contributed to the problem of research synthesis was the idea of dividing mean differences by standard deviations. "Effect sizes," as they are called, had been around for decades before I opened my first statistics text. Having to read that "Glass has proposed integrating studies by dividing mean differences by standard deviations and averaging them" was a bitter pill to swallow. Some of the earliest work that I and my colleagues did involved using a variety of outcome measures to be analyzed and synthesized: correlations, regression coefficients, proportions, odds ratios. Well, so be it; better to be mentioned in any favorable light than not to be remembered at all.

After all, this was not as hard to take as newly minted confections such as "best evidence research synthesis," a come-lately contribution that added nothing whatsoever to what myself and many others had been saying repeatedly on the question of whether meta- analyses should use all studies or only "good" studies. I remain staunchly committed to the idea that meta-analyses must deal with all studies, good bad and indifferent, and that their results are only properly understood in the context of each other, not after having been censored by some a priori set of prejudices. An effect size of 1.50 for 20 studies employing randomized groups has a whole different meaning when 50 studies using matching show an average effect of 1.40 than if 50 matched groups studies show an effect of -.50, for example.

Statistical Inference in Meta-analysis

The appropriate role for inferential statistics in meta-analysis is not merely unclear, it has been seen quite differently by different methodologists in the 25 years since meta- analysis appeared. In 1981, in the first extended discussion of the topic (Glass, McGaw and Smith, 1981), I raised doubts about the applicability of inferential statistics in meta- analysis. Inference at the level of persons within seemed quite unnecessary, since even a modest size synthesis will involve a few hundred persons (nested within studies) and lead to nearly automatic rejection of null hypotheses. Moreover, the chances are remote that the persons or subjects within studies were drawn from defined populations with anything even remotely resembling probabilistic techniques. Hence, probabilistic calculations advanced as if subjects had been randomly selected would be dubious. At the level of "studies," the question of the appropriateness of inferential statistics can be posed again, and the answer again seems to be negative. There are two instances in which common inferential methods are clearly appropriate, not just in mata-analysis but in any research: 1) when a well defined population has been randomly sampled, and 2) when subjects have been randomly assigned to conditions in a controlled experiment. In the latter case, Fisher showed how the permutation test can be used to make inferences to the universe of all possible permutations. But this case is of little interest to meta-analysts who never assign units to treatments. Moreover, the typical meta-analysis virtually never meets the condition of probabilistic sampling of a population (though in one instance (Smith, Glass & Miller, 1980), the available population of psychoactive drug treatment experiments was so large that a random sample of experiments was in fact drawn for the meta- analysis). Inferential statistics has little role to play in meta-analysis

It is common to acknowledge, in meta-analysis and elsewhere, that many data sets fail to meet probabilistic sampling conditions, and then to argue that one ought to treat the data in hand "as if" it were a random sample of some hypothetical population. One must be wary here of the slide from "hypothesis about a population" into "a hypothetical population." They are quite different things, the former being standard and unobjectionable, the latter being a figment with which we hardly know how to deal. Under this stipulation that one is making inferences not to some defined or known population but a hypothetical one, inferential techniques are applied and the results inspected. The direction taken mirrors some of the earliest published opinion on this problem in the context of research synthesis, expressed, for example, by Mosteller and his colleagues in 1977: "One might expect that if our MEDLARS approach were perfect and produced all the papers we would have a census rather than a sample of the papers. To adopt this model would be to misunderstand our purpose. We think of a process producing these research studies through time, and we think of our sample—even if it were a census—as a sample in time from the process. Thus, our inference would still be to the general process, even if we did have all appropriate papers from a time period." (Gilbert, McPeek and Mosteller, 1977, p. 127; quoted in Cook et al., 1992, p. 291) This position is repeated in slightly different language by Larry Hedges in Chapter 3 "Statistical Considerations" of the Handbook of Research Synthesis (1994): "The universe is the hypothetical collection of studies that could be conducted in principle and about which we wish to generalize. The study sample is the ensemble of studies that are used in the review and that provide the effect size data used in the research synthesis." (p. 30)

These notions appear to be circular. If the sample is fixed and the population is allowed to be hypothetical, then surely the data analyst will imagine a population that resembles the sample of data. If I show you a handful of red and green M&Ms, you will naturally assume that I have just drawn my hand out of a bowl of mostly red and green M&Ms, not red and green and brown and yellow ones. Hence, all of these "hypothetical populations" will be merely reflections of the samples in hand and there will be no need for inferential statistics. Or put another way, if the population of inference is not defined by considerations separate from the characterization of the sample, then the population is merely a large version of the sample. With what confidence is one able to generalize the character of this sample to a population that looks like a big version of the sample? Well, with a great deal of confidence, obviously. But then, the population is nothing but the sample writ large and we really know nothing more than what the sample tells us in spite of the fact that we have attached misleadingly precise probability numbers to the result.

Hedges and Olkin (1985) have developed inferential techniques that ignore the pro forma testing (because of large N) of null hypotheses and focus on the estimation of regression functions that estimate effects at different levels of study. They worry about both sources of statistical instability: that arising from persons within studies and that which arises from variation between studies. The techniques they present are based on traditional assumptions of random sampling and independence. It is, of course, unclear to me precisely how the validity of their methods are compromised by failure to achieve probabilistic sampling of persons and studies.

The irony of traditional hypothesis testing approaches applied to meta-analysis is that whereas consideration of sampling error at the level of persons always leads to a pro forma rejection of "null hypotheses" (of zero correlation or zero average effect size), consideration of sampling error at the level of study characteristics (the study, not the person as the unit of analysis) leads to too few rejections (too many Type II errors, one might say). Hedges's homogeneity test of the hypothesis that all studies in a group estimate the same population parameter frequently seen in published meta-analyses these days. Once a hypothesis of homogeneity is accepted by Hedges's test, one is advised to treat all studies within the ensemble as the same. Experienced data analysts know, however, that there is typically a good deal of meaningful covariation between study characteristics and study findings even within ensembles where Hedges's test can not reject the homogeneity hypothesis. The situation is parallel to the experience of psychometricians discovering that they could easily interpret several more common factors than inferential solutions (maximum- likelihood; LISREL) could confirm. The best data exploration and discovery are more complex and convincing than the most exact inferential test. In short, classical statistics seems not able to reproduce the complex cognitive processes that are commonly applied with success by data analysts.

Donald Rubin (1990) addressed some of these issues squarely and articulated a position that I find very appealing : "...consider the idea that sampling and representativeness of the studies in a meta-analysis are important. I will claim that this is nonsense—we don't have to worry about representing a population but rather about other far more important things." (p. 155) These more important things to Rubin are the estimation of treatment effects under a set of standard or ideal study conditions. This process, as he outlined it, involves the fitting of response surfaces (a form of quantitative model building) between study effects (Y) and study conditions (X, W, Z etc.). I would only add to Rubin's statement that we are interested in not merely the response of the system under ideal study conditions but under many conditions having nothing to do with an ideally designed study, e.g., person characteristics, follow-up times and the like.

By far most meta-analyses are undertaken in pursuit not of scientific theory but technological evaluation. The evaluation question is never whether some hypothesis or model is accepted or rejected but rather how "outputs" or "benefits" or "effect sizes" vary from one set of circumstances to another; and the meta-analysis rarely works on a collection of data that can sensibly be described as a probability sample from anything.

Meta-analysis in the Next 25 Years

If our efforts to research and improve education are to prosper, meta-analysis will have to be replaced by more useful and more accurate ways of synthesizing research findings. To catch a glimpse of what this future for research integration might look like, we need to look back at the deficiencies in our research customs that produced meta-analysis in the first place.

First, the high cost in the past of publishing research results led to cryptic reporting styles that discarded most of the useful information that research revealed. To encapsulate complex relationships in statements like "significant at the .05 level" was a travesty—a travesty that continues today out of bad habit and bureaucratic inertia.

Second, we need to stop thinking of ourselves as scientists testing grand theories, and face the fact that we are technicians collecting and collating information, often in quantitative forms. Paul Meehl (1967; 1978) dispelled once and for all the misconception that we in, what he called, the "soft social sciences" are testing theories in any way even remotely resembling how theory focuses and advances research in the hard sciences. Indeed, the mistaken notion that we are theory driven has, in Meehl's opinion, led us into a worthless pro forma ritual of testing and rejecting statistical hypotheses that are a priori known to be 99% false before they are tested.

Third, the conception of our work that held that "studies" are the basic, fundamental unit of a research program may be the single most counterproductive influence of all. This idea that we design a "study," and that a study culminates in the test of a hypothesis and that a hypothesis comes from a theory—this idea has done more to retard progress in educational research than any other single notion. Ask an educational researcher what he or she is up to, and they will reply that they are "doing a study," or "designing a study," or "writing up a study" for publication. Ask a physicist what's up and you'll never hear the word "study." (In fact, if one goes to http://xxx.lanl.gov where physicists archive their work, one will seldom see the word "study." Rather, physicists—the data gathering experimental ones—report data, all of it, that they have collected under conditions that they carefully described. They contrive interesting conditions that can be precisely described and then they report the resulting observations.)

Meta-analysis was created out of the need to extract useful information from the cryptic records of inferential data analyses in the abbreviated reports of research in journals and other printed sources. "What does this t-test really say about the efficacy of ritalin in comparison to caffeine?" Meta-analysis needs to be replaced by archives of raw data that permit the construction of complex data landscapes that depict the relationships among independent, dependent and mediating variables. We wish to be able to answer the question, "What is the response of males ages 5-8 to ritalin at these dosage levels on attention, acting out and academic achievement after one, three, six and twelve months of treatment?"

We can move toward this vision of useful synthesized archives of research now if we simply re-orient our ideas about what we are doing when we do research. We are not testing grand theories, rather we are charting dosage-response curves for technological interventions under a variety of circumstances. We are not informing colleagues that our straw-person null hypothesis has been rejected at the .01 level, rather we are sharing data collected and reported according to some commonly accepted protocols. We aren't publishing "studies," rather we are contributing to data archives.

Five years ago, this vision of how research should be reported and shared seemed hopelessly quixotic. Now it seems easily attainable. The difference is the I-word: the Internet. In 1993, spurred by the ludicrously high costs and glacial turn-around times of traditional scholarly journals, I created an internet-based peer-reviewed journal on education policy analysis (http://epaa.asu.edu). This journal, named Education Policy Analysis Archives, is now in its seventh year of publication, has published 150 articles, is accessed daily without cost by nearly 1,000 persons (the other three paper journals in this field have average total subscription bases of fewer than 1,000 persons), and has an average "lag" from submission to publication of about three weeks. Moreover, we have just this year started accepting articles in both English and Spanish. And all of this has been accomplished without funds other than the time I put into it as part of my normal job: no secretaries, no graduate assistants, nothing but a day or two a week of my time.

Two years ago, we adopted the policy that any one publishing a quantitative study in the journal would have to agree to archive all the raw data at the journal website so that the data could be downloaded by any reader. Our authors have done so with enthusiasm. I think that you can see how this capability puts an entirely new face on the problem of how we integrate research findings: no more inaccurate conversions of inferential test statistics into something worth knowing like an effect size or a correlation coefficient or an odds ratio; no more speculating about distribution shapes; no more frustration at not knowing what violence has been committed when linear coefficients mask curvilinear relationships. Now we simply download each others' data, and the synthesis prize goes to the person who best assembles the pieces of the jigsaw puzzle into a coherent picture of how the variables relate to each other.

References

Cook, T.D. Meta-analysis for explanation — a casebook. New York: Russell Sage Foundation; 1992.

Cooper, H.M. (1989). Integrating research: a guide for literature reviews. 2nd ed. Newbury Park, CA: SAGE Publications.

Cooper, H.M. and Hedges, L. V. (Eds.) (1994). The handbook of research synthesis. New York: Russell Sage Foundation.

Cronbach, L.J. (1982). Prudent Aspirations for Social Inquiry. Chapter 5 (Pp. 61-81) in Kruskal, W.H. (Ed.), The social sciences: Their nature and uses. Chicago: The University of Chicago Press.

Eysenck, H.J. (1965). The effects of psychotherapy. International Journal of Psychiatry, 1, 97-178.

Glass, G. V (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.

Glass, G.V (1978). Integrating findings: The meta-analysis of research. Review of Research in Education, 5, 351-379.

Glass, G. V, McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: SAGE Publications.

Glass, G.V et al. (1982). School class size: Research and policy. Beverly Hills, CA: SAGE Publications.

Glass, G.V and Robbins, M.P. (1967). A critique of experiments on the role of neurological organization in reading performance. Reading Research Quarterly, 3, 5-51.

Hedges, L. V., Laine, R. D., & Greenwald, R. (1994). Does Money Matter? A Meta- Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes. Educational Researcher, 23(3): 5-14.

Hedges, L.V. and Olkin, I. Statistical methods for meta-analysis. New York: Academic Press; 1985.

Hunt, M. (1997). How science takes stock: The story of meta-analysis. NY: Russell Sage Foundation.

Hunter, J.E. & Schmidt, F.L. (1990). Methods of meta-analysis: correcting error and bias in research findings. Newbury Park (CA): SAGE Publications.

Hunter, J.E., Schmidt, F.L. & Jackson, G.B. (1982). Meta-analysis: cumulating research findings across studies. Beverly Hills, CA: SAGE Publications.

Light, R. J., Singer, J. D., & Willett, J. B. (1990). By design: Planning research on higher education. Cambridge, MA: Harvard University Press.

Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-15.

Meehl. P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.

Rosenthal, R. (1976). Experimenter effects in behavioral research. New York: John Wiley.

Rosenthal, R. (1991). Meta-analytic procedures for social research. Rev. ed. Newbury Park (CA): SAGE Publications.

Rubin, D. (1990). A new perspective. Chp. 14 (pp. 155-166) in Wachter, K. W. & Straf, M. L. (Eds.). The future of meta-analysis. New York: Russell Sage Foundation.

Smith, M.L. and Glass, G.V. (1977). Meta-analysis of psychotherapy outcome studies. American Psychologist, 32, 752-60.

Smith, M.L., Glass, G.V and Miller, T.I. (1980). The benefits of psychotherapy. Baltimore: Johns Hopkins University Press.

Underwood, B.J. Interference and forgetting. Psychological Review, 64(1), 49–60.

Wachter, K.W. and Straf, M.L., (Editors). (1990). The future of meta-analysis. New York: Russell Sage Foundation.

Wolf, F.M. (1986). Meta-analysis: quantitative methods for research synthesis. Beverly Hills, CA: SAGE Publications.

Thursday, August 20, 2020

Why Bother Testing in 2021?

Gene V Glass
David C. Berliner

At a recent Education Writers Association seminar, Jim Blew, an assistant to Betsy DeVos at the Department of Education, opined that the Department is inclined not to grant waivers to states seeking exemptions from the federally mandated annual standardized achievement testing. States like Michigan, Georgia, and South Carolina were seeking a one year moratorium. Blew insisted that “even during a pandemic [tests] serve as an important tool in our education system.” He said that the Department’s “instinct” was to grant no waivers. What system he was referring to and important to whom are two questions we seek to unravel here.

​ Without question, the “system” of the U.S. Department of Education has a huge stake in enforcing annual achievement testing. It’s not just that the Department’s relationship is at stake with Pearson Education, the U.K. corporation that is the major contractor for state testing, with annual revenues of nearly $5 billion. The Department’s image as a “get tough” defender of high standards is also at stake. Pandemic be damned! We can’t let those weak kneed blue states get away with covering up the incompetence of those teacher unions. ​

To whom are the results of these annual testings important? Governors? District superintendents? Teachers? ​

How the governors feel about the test results depends entirely on where they stand on the political spectrum. Blue state governors praise the findings when they are above the national average, and they call for increased funding when they are below. Red state governors, whose state’s scores are generally below average, insist that the results are a clear call for vouchers and more charter schools – in a word, choice. District administrators and teachers live in fear that they will be blamed for bad scores; and they will. ​

Fortunately, all the drama and politicking about the annual testing is utterly unnecessary. Last year’s district or even schoolhouse average almost perfectly predicts this year’s average. Give us the average Reading score for Grade Three for any medium or larger size district for the last year and we’ll give you the average for this year within a point or two. So at the very least, testing every year is a waste of time and money – money that might ultimately help cover the salary of executives like John Fallon, Pearson Education CEO, whose total compensation in 2017 was more than $4 million. ​

But we wouldn’t even need to bother looking up a district’s last year’s test scores to know where their achievement scores are this year. We can accurately predict those scores from data that cost nothing. It is well known and has been for many years – just Google “Karl R. White” 1982 – that a school’s average socio-economic status (SES) is an accurate predictor of its achievement test average. “Accurate” here means a correlation exceeding .80. Even though a school’s racial composition overlaps considerably with the average wealth of the families it serves, adding Race to the prediction equation will improve the prediction of test performance. Together, SES and Race tell us much about what is actually going on in the school lives of children: the years of experience of their teachers; the quality of the teaching materials and equipment; even the condition of the building they attend. ​

Don’t believe it? Think about this. The free and reduced lunch rate (FRL) at the 42 largest high schools in Nebraska was correlated with the school’s average score in Reading, Math, and Science on the Nebraska State Assessments. The correlations obtained were FRL & Reading r = -.93, FRL & Science r = -.94, and FRL & Math r = -.92. Correlation coeficients don’t get higher than 1.00. ​

If you can know the schools’ test scores from their poverty rate, why give the test? ​

In fact, Chris Tienken answered that very question in New Jersey. With data on household income, % single parent households, and parent education level in each township, he predicted a township’s rates of scoring “proficient” on the New Jersy state assessment. In Maple Shade Township, 48.71% of the students were predicted to be proficient in Language Arts; the actual proficiency rate was 48.70%. In Mount Arlington township, 61.4% were predicted proficient; 61.5% were actually proficient. And so it went. Demographics may not be destiny for individuuals, but when you want a reliable, quick, inexpensive estimate of how a school, township, or district is doing in terms of their achievement scores on a standardized test of acheievement, demographics really are destiny, until governments at many levels get serious about addressing the inequities holding back poor and minority schools! ​

There is one more point to consider here: a school can more easily “fake” its achievement scores than it can fake its SES and racial composition. Test scores can be artificially raised by paying a test prep company, or giving just a tiny bit more time on the test, looking the other way as students whip out their cell phones during the test, by looking at the test before hand and sharing some “ideas” with students about how they might do better on the tests, or examining the tests after they are given and changing an answer or two here and there. These are not hypothetical examples; they go on all the time. ​

However, don’t the principals and superintendents need the test data to determine which teachers are teaching well and which ones ought to be fired? That seems logical but it doesn’t work. Our colleague Audrey Amrein Beardsley and her students have addressed this issue in detail on the blog VAMboozled. In just one study, a Houston teacher was compared to other teachers in other schools sixteen different times over four years. Her students’ test scores indicated that she was better than the other teachers 8 times and worse than the others 8 times. So, do achievement tests tell us whether we have identified a great teacher, or a bad teacher? Or do the tests merely reveal who was in that teacher’s class that particualr year? Again, the makeup of the class – demographics like social class, ethnicity, and native language – are powerful determiners of test scores. ​

But wait. Don’t the teachers need the state standardized test results to know how well their students are learning, what they know and what is still to be learned? Not at all. By Christmas, but certainly by springtime when most of the standardized tests are given, teachers can accurately tell you how their students will rank on those tests. Just ask them! And furthermore, they almost never get the information about their stdudents’ acheievement until the fall following the year they had those students in class making the information value of the tests nil! ​ In a pilot study by our former ASU student Annapurna Ganesh, a dozen 2nd and 3rd grade teachers ranked their children in terms of their likely scores on their upcoming Arizona state tests. Correlations were uniformly high – as high in one class as +.96! In a follow up study reading, followed by math, here are the correlations found for 8 of the third-grade teachers who predicted the ranking of their students on that year’s state of Arizona standardized test: ​

             Number of    Rank Order     Rank Order
             Students    Correlation,   Correlation,
                           Reading      Mathematics
Teacher A:      26          .907            .867
Teacher B:      24          .950            .855
Teacher C:      22          .924            .801
Teacher D:      23          .940            .899
Teacher E:      23          .891            .831
Teacher F:      27          .893            .873
Teacher G:      24          .890            .835
Teacher H:      24          .895            .837
For the larger sample the lowest rank order coefficient between a teacher’s ranking of the students and the student’s ranking on the state Math test was +.72! Berliner took these results to the Arizona Department of Education, informing them that they could get the information they wanted about how children are doing in about 10 minutes and for no money! He was told that he was “lying,” and shown out of the office. The abuse must go on. Contracts must be honored. ​

Predicting rank can’t tell you the national percentile of this child or that, but that information is irrelevant to teachers anyway. Teachers usually know which child is struggling, which is soaring, and what both of them need. That is really the information that they need! ​

Thus far as we argue against the desire our federal Department of Education to reinstitute acheivement testing in each state, we neglected to mention a test’s most important characteristic—its validity. We mention here, briefly, just one type of validity, content validity. To have content validity students in each state have to be exposed to/taught the curriculum for which the test is appropriate. The US Department of Education seems not to have noticed that since March 2020 public schooling has been in a bit of an upheaval! The chances that each district, in each state, has provided equal access to the curriculm on which a states’ test is based, is fraught under normal circumstances. In a pandemic it is a remarkably stupid assumption! We assert that no state achievement test will be content valid if given in the 2020-2021 school year. Furthermore, those who help in administering and analyzing such tests are likely in violation of the testing standards of the American Psycholgical Association, the American Educational Research Association, and the National Council on Measurement in Education. In addition to our other concerns with state standardized tests, there is no defensible use of an invalid test. Period. ​

We are not opposed to all testing, just to stupid testing. The National Assessment Governing Board voted 12 to 10 in favor of administering NAEP in 2021. There is some sense to doing so. NAEP tests fewer than 1 in 1,000 students in grades 4, 8, and 12. As a valid longitudinal measure, the results could tell us the extent of the devastation of the Corona virus.

We end with some good news. The DeVos Department of Education position on Spring 2021 testing is likely to be utterly irrelevant. She and assistant Blew are likely to be watching the operation of the Department of Education from the sidelines after January 21, 2021. We can only hope that members of a new admistration read this and understand that some of the desperately needed money for American public schools can come from the huge federal budget for standardized testing. Because in seeking the answer to the question “Why bother testing in 2021?” we have necessarily confronted the more important question: “Why ever bother to administer these mandated tests?” ​

We hasten to add that we are not alone in this opinion. Among measurement experts competent to opine on such things, our colleagues at the National Education Policy Center likewise question the wisdom of a 2021 federal government mandated testing.

Gene V Glass
David C. Berliner
Arizona State University
National Education Policy Center
University of Colorado Boulder
San Jose State University

The opinions expressed here are those of the authors and do not represent the official position of the National Education Policy Center, Arizona State University, nor the University of Colorado Boulder, nor San Jose State University.