My Last Day on Earth as a "Quantoid"
Gene V Glass
I was taught early in my professional career that personal recollections were not proper stuff for academic discourse. The teacher was my graduate adviser Julian Stanley, and the occasion was the 1963 Annual Meeting of the American Educational Research Association. Walter Cook, of the University of Minnesota, had finished delivering his AERA presidential address. Cook had a few things to say about education, but he had used the opportunity to thank a number of personal friends for their contribution to his life and career, including Nate Gage and Nate's wife; he had spoken of family picnics with the Gages and other professional friends. Afterwards, Julian and Ellis Page and a few of us graduate students were huddled in a cocktail party listening to Julian's post mortem of the presidential remarks. He made it clear that such personal reminiscences on such an occasion were out of place, not to be indulged in. The lesson was clear, but I have been unable to desist from indulging my own predilection for personal memories in professional presentations. But that early lesson has not been forgotten. It remains as a tug on conscience from a hidden teacher, a twinge that says "You should not be doing this," whenever I transgress.
Bob Stake and I and Tom Green and Ralph Tyler (to name only four) come from a tiny quadrilateral no more than 30 miles on any side in Southeastern Nebraska, a fertile crescent (with a strong gradient trailing off to the northeast) that reaches from Adams to Bethany to South Lincoln to Crete, a mesopotamia between the Nemaha and the Blue Rivers that had no more than 100,000 population before WW II. I met Ralph Tyler only once or twice, and both times it was far from Nebraska. Tom Green and I have a relationship conducted entirely by email; we have never met face-to-face. But Bob Stake and I go back a long way.
On a warm autumn afternoon in 1960, I was walking across campus at the University of Nebraska headed for Love Library and, as it turned out, walking by chance into my own future. I bumped into Virgina Hubka, a young woman of 19 at the time, with whom I had grown up since the age of 10 or 11. We seldom saw each other on campus. She was an Education major, and I was studying math and German with prospects of becoming a foreign language teacher in a small town in Nebraska. I had been married for two years at that time and felt a chronic need of money that was being met by janitorial work. Ginny told me of a job for a computer programmer that had just been advertised in the Ed Psych Department where she worked part time as a typist. A new faculty member—just two years out of Princeton with a shiny new PhD in Psychometrics—by the name of Bob Stake had received a government grant to do research.
I looked up Stake and found a young man scarcely ten years my senior with a remarkably athletic looking body for a professor. He was willing to hire a complete stranger as a computer programmer on his project, though the applicant admitted that he had never seen a computer (few had in those days). The project was a monte carlo simulation of sampling distributions of latent roots of the B* matrix in multi-dimensional scaling—which may shock latter-day admirers of Bob's qualitative contributions. Stake was then a confirmed "quantoid" (n., devotee of quantitative methods, statistics geek). I took a workshop and learned to program a Burroughs 205 computer (competitor with the IBM 650); the 205 took up an entire floor of Nebraska Hall, which had to have special air conditioning installed to accommodate the heat generated by the behemoth. My job was to take randomly generated judgmental data matrices and convert them into a matrix of cosines of angles of separation among vectors representing stimulus objects. It took me six months to create and test the program; on today's equipment, it would require a few hours. Bob took over the resulting matrix and extracted latent roots to be compiled into empirical sampling distributions.
The work was in the tradition of metric scaling invented by Thurstone and generalized to the multidimensional case by Richardson and Torgerson and others; it was heady stuff. I was allowed to operate the computer in the middle of the night, bringing it up and shutting it down by myself. Bob found an office for me to share with a couple of graduate students in Ed Psych. I couldn't believe my good luck; from scrubbing floors to programming computers almost overnight. I can recall virtually every detail of those two years I spent working for Bob, first on the MDS project, then on a few other research projects he was conducting (even creating Skinnerian-type programmed instruction for a study of learner activity; my assignment was to program instruction in the Dewey Decimal system).
Stake was an attractive and fascinating figure to a young man who had never in his 20 years on earth traveled farther than 100 miles from his birthplace. He drove a Chevy station wagon, dusty rose and silver. He lived on the south side of Lincoln, a universe away from the lower-middle class neighborhoods of my side of town. He had a beautiful wife and two quiet, intense young boys who hung around his office on Saturdays silently playing games with paper and pencil. In the summer of 1961, I was invited to the Stake's house for a barbecue. Several graduate students were there (Chris Buethe, Jim Beaird, Doug Sjogren). The backyard grass was long and needed mowing; in the middle of the yard was a huge letter "S" carved by a lawn mower. I imagined Bernadine having said once too often, "Bob, would you please mow the backyard?" (Bob's children tell me that he was accustomed to mowing mazes in the yard and inventing games for them that involved playing tag without leaving the paths.)
That summer, Bob invited me to drive with him to New York City to attend the ETS Invitational Testing Conference. Bob's mother would go with us. Mrs. Stake was a pillar of the small community, Adams, 25 miles south of Lincoln where Bob was born and raised. She regularly spoke at auxiliary meetings and other occasions about the United Nations, then only 15 years old. The trip to New York would give her a chance to renew her experiences and pick up more literature for her talks. Taking me along as a spare driver on a 3,500 mile car trip may not have been a completely selfless act on Bob's part, but going out of the way to visit the University of Wisconsin so that I could meet Julian Stanley and learn about graduate school definitely was generous. Bob had been corresponding with Julian since the Spring of 1961. The latter had written his colleagues around the country urging them to test promising young students of their acquaintance and send him any information about high scores. In those pre-GRE days, the Miller Analogies Test and the Doppelt Mathematical Reasoning Test were the instruments of choice. Julian was eager to discover young, high scorers and accelerate them through a doctoral program, thus preventing for them his own misfortune of having wasted four of his best years in an ammunition dump in North Africa during WW II—and presaging his later efforts to identify math prodigies in middle school and accelerate them through college. Bob had created his own mental ability test, named with the clever pun QED, the Quantitative Evaluative Device. Bob asked me to take all three tests; I loved taking them. He sent the scores to Julian, and subsequently the stop in Madison was arranged. Bob had made it clear that I should not attend graduate school in Lincoln.
We drove out of Lincoln—the professor, the bumpkin and Adams's Ambassador to the U.N.—on October 27, 1961. Our first stop was Platteville, Wisconsin, where we spent the night with Bill Jensen, a former student of Bob's from Nebraska. Throughout the trip we were never far from Bob's former students who seemed to feel privileged to host his retinue. On day two, we met Julian in Madison and had lunch at the Union beside Lake Mendota with him and Les McLean and Dave Wiley. The company was intimidating; I was certain that I did not fit in and that Lincoln was the only graduate school I was fit for. We spent the third night sleeping in the attic apartment of Jim Beaird, whose dissertation that spring was a piece of the Stake MDS project; he had just started his first academic job at the University of Toledo. The fourth day took us through the Allegheny Mountains in late October; the oak forests were yellow, orange and crimson, so unlike my native savanna. We shared the driving. Bob drove through rural New Jersey searching for the small community where his brother Don lived; he had arranged to drop off his mother there. The maze was negotiated without the aid of road maps or other prostheses; indeed, none was consulted during the entire ten days. That night was spent in Princeton. Fred Kling, a former ETS Princeton Psychometric Fellow at Princeton with Bob, and his wife entertained us with a spaghetti dinner by candlelight. It was the first time in my life I had seen candles on a dinner table other than during a power outage, as it was also the first time I had tasted spaghetti not out of a can. .
The next day we called on Harold Gulliksen at his home. Gulliksen had been Bob's adviser at Princeton. We were greeted by his wife, who showed us to a small room outside his home office. We waited a few minutes while he disengaged from some strenuous mental occupation. Gulliksen swept into the room wearing white shirt and tie; he shook my hand when introduced; he focused on Bob's MDS research. The audience was over within fifteen minutes. I didn't want to return to Princeton.
We drove out to the ETS campus. Bob may have been gone for three years, but he was obviously not forgotten. Secretaries in particular seemed happy to see him. Bob was looking for Sam Messick. I was overwhelmed to see that these citations—(Abelson and Messick, 1958)—were actual persons, not like anything I had ever seen in Nebraska of course, but actual living, breathing human beings in whose presence one could remain for several minutes without something disastrous happening. Bob reported briefly on our MDS project to Messick. Sam had a manuscript in front of him on his desk. "Well, it may be beside the point," Messick replied to Bob's description of our findings. He held up the manuscript. It was a pre-publication draft of Roger Shepard's "Analysis of Proximities," which was to revolutionize multidimensional scaling and render our monte carlo study obsolete. It was October 30, 1961. It was Bob Stake's last day on earth as a quantoid.
The ETS Invitational Testing Conference was held in the Roosevelt Hotel in Manhattan. We bunked with Hans Steffan in East Orange and took the tube to Manhattan. Hans had been another Stake student; he was a native German and I took the opportunity to practice my textbook Deutsch. I will spare the reader a 21-year-old Nebraska boy's impressions of Manhattan, all too shopworn to bear repeating. The Conference was filled with more walking citations: Bob Ebel, Ledyard Tucker, E. F. Lindquist, Ted Cureton, famous name after famous name. (Ten years later, I had the honor of chairing the ETS Conference, which gave me the opportunity to pick the roster of speakers along with ETS staff. I asked Bob to present his ideas on assessment; he gave a talk about National Assessment that featured a short film that he had made. People remarked that they were not certain that he was being "serious." His predictions about NAEP were remarkably prescient.)
We picked up Bob's mother in Harrisburg, Pennsylvania, for some reason now forgotten. While we had listened to papers, she had invaded and taken over the U.N. We pointed the station wagon west; we made one stop in Toledo to sleep for a few hours. I did more than my share behind the wheel. I was extremely tired, having not slept well in New York. Bob and I usually slept in the same double bed on this trip and I was too worried about committing some gross act in my sleep to rest comfortably. I had a hard time staying awake during my stints at the wheel, but I would not betray weakness by asking for relief. I nearly fell asleep several times through Ohio, risking snuffing out two promising academic careers and breaking Adams, Nebraska's only diplomatic tie to the United Nations.
To help relieve the boredom of the long return trip, Bob and I played a word game that he had learned or invented. It was called "Ghost." Player one thinks of a five-letter word, say "spice." Player two guesses a five-letter word to start; suppose I guessed "steam." Player one superimposes, in his mind, the target word "spice" and my first guess "steam" and sees that one letter coincides—the "s." Since one letter is an odd number of letters, he replies "odd." If no letters coincide he says "even." If I had been very lucky—actually unlucky—and first guessed "slice," player one would reply "even" because four letters coincide. (This would actually have been an unlucky start since one reasonably assumes that the initial response "even" means that zero letters coincide. I think that games of this heinous intricacy are not unknown to Stake children.) Through a process of guessing words and deducing coincidences from "odd" and "even" responses, player two eventually discovers player one's word. It is a difficult game and it can consume hundreds of miles on the road. Several rounds of the game took us through Ohio, Indiana, Illinois. Somewhere around the Quad Cities, Bob played his trump card. He was thinking of a word that resisted all my most assiduous attempts at deciphering. Finally, outside Omaha I conceded defeat. His word was "ouija," as in the board. Do we take this incident as in some way a measure of this man?
By the time I arrived in Lincoln, a Western Union Telegram from Julian was waiting. I had never before received a telegram—or known anyone who had. I was flattered; I was hooked. Three months later, January 1962, I left Lincoln, Stake and everything I had known my entire life for graduate school. Bob and I corresponded regularly during the ensuing years. He wrote to tell me that he had taken a job at Urbana. I told him I was learning all that was known about statistics. He wrote several times during his summer, 1964, at Stanford in the institute that Lee Cronbach and Richard Atkinson conducted. Clearly it was a transforming experience for him. I was jealous. When I finished my degree in 1965, Bob had engineered a position for me in CIRCE at Univ. of Illinois. I was there when Bob wrote his "Countenance" paper; I pretended to understand it. I learned that there was a world beyond statistics; Bob had undergone enormous changes intellectually since our MDS days. I admired them, even as I recognized my own inability to follow. I spent two years at CIRCE; I think I felt the need to shine my own light away from the long shadows. I picked a place where I thought I might shine: Colorado.
Bob and I saw very little of each other from 1967 on. In the early 1970s, I invited him to teach summer school at Boulder. He gave a seminar on evaluation and converted all my graduate students into Stake-ians. But I saw little of him that summer. We didn't connect again until 1978.
When the year 1978 arrived, I was at the absolute height of my powers as a quantoid. My book on time-series experiment analysis was being reviewed by generous souls who called it a "watershed." Meta- analysis was raging through the social and behavioral sciences. I had nearly completed the class-size meta-analysis. The Hastings Symposium, on the occasion of Tom Hastings's retirement as head of CIRCE, was happening in Urbana in January. I attended. Lee Cronbach delivered a brilliant paper that gradually metamorphosed into his classic Designing Evaluations of Educational and Social Programs. Lee argued that the place of controlled experiments in educational evaluation is much less than we had once imagined. "External validity," if we must call it that, is far more important than "internal validity," which is after all not just an impossibility but a triviality. Experimental validity can not be reduced to a catechism. Well, this cut to the heart of my quantoid ideology, and I remember rising during the discussion of Lee's paper to remind him that controlled, randomized experiments worked perfectly well in clinical drug trials. He thanked me for divulging this remarkable piece of intelligence.
That summer I visited Eva Baker's Center for the Study of Evaluation at UCLA for eight weeks. Bob came for two weeks at Eva's invitation. One day he dropped a sheet of paper on my desk that contained only these words:
I was a quantoid, and "what I do best" was peaking. I gave a colloquium at Eva's center on the class size meta-analysis in mid- June. People were amazed. Jim Popham asked for the paper to inaugurate his new journal Educational Evaluation and Policy Analysis. He was welcome to it.
June 30, 1978, dawned inauspiciously; I had no warning that it would be my last day on earth as a quantoid. Bob was to speak at a colloquium at the Center on whatever it was that was on his mind at that moment. Ernie House was visiting from Urbana. I was looking forward to the talk, because Bob never gave a dull lecture in his life. That day he talked about portrayal, complexity, understanding; qualities that are not yet nor may never be quantities; the ineffable (Bob has never been a big fan of the "effable"). I listened with respect and admiration, but I listened as one might listen to stories about strange foreign lands, about something that was interesting but that bore no relationship to one's own life. Near the end when questions were being asked I sought to clarify the boundaries that contained Bob's curious thoughts. I asked, "Just to clarify, Bob, between an experimentalist evaluator and a school person with intimate knowledge of the program in question, who would you trust to produce the most reliable knowledge of the program's efficacy?" I sat back confident that I had shown Bob his proper place in evaluation—that he couldn't really claim to assess impact, efficacy, cause-and-effect with his case-study, qualitative methods—and waited for his response, which came with uncharacteristic alacrity. "The school person," he said. I was stunned. Here was a person I respected without qualification whose intelligence I had long admired who was seeing the world far differently from how I saw it.
Bob and Ernie and I stayed long after the colloquium arguing about Bob's answer, rather Ernie and I argued vociferously while Bob occasionally interjected a word or sentence of clarification. I insisted that causes could only be known (discovered, found, verified) by randomized, controlled experiments with double-blinding and followed up with statistical significance tests. Ernie and Bob argued that even if you could bring off such an improbable event as the experiment I described, you still wouldn't know what caused a desirable outcome in a particular venue. I couldn't believe what they were saying; I heard it, but I thought they were playing Jesuitical games with words. Was this Bob's ghost game again?
Eventually, after at least an hour's heated discussion I started to see Bob and Ernie's point. Knowledge of a "cause" in education is not something that automatically results from one of my ideal experiments. Even if my experiment could produce the "cause" of a wonderful educational program, it would remain for those who would share knowledge of that cause with others to describe it to them, or act it out while they watched , or somehow communicate the actions, conditions and circumstances that constitute the "cause" that produces the desired effect. They—Bob and Ernie—saw the experimenter as not trained, not capable of the most important step in the chain: conveying to others a sense of what works and how to bring it about. "Knowing" what caused the success is easier, they believed, than "portraying" to others a sense for what is known.
I can not tell you, dear reader, why I was at that moment prepared to accept their belief and their arguments, but I was. What they said in that hour after Bob's colloquium suddenly struck me as true. And in the weeks and months after that exchange in Moore Hall at UCLA, I came to believe what they believed about studying education and evaluating schools: many people can know causes; few experiments can clarify causal claims; telling others what we know is the harder part. It was my last day on earth as a quantoid.
In the early 1970s, Bob introduced me to the writings of another son of Lincoln, Loren Eiseley, the anthropologist, academic and author, whom Wystan H. Auden once named as one of the leading poets of his generation. Eiseley wrote often about his experiences in the classroom; he wrote of "hidden teachers," who touch our lives and never leave us, who speak softly at the back of our minds, who say "Do this; don’t do that."
In his book The Invisible Pyramid, Eiseley wrote of "The Last Magician." "Every man in his youth—and who is to say when youth is ended?—meets for the last time a magician, a man who made him what he is finally to be." (p. 137) For Eiseley, that last magician is no secret to those who have read his autobiography, All the Strange Hours; he was Frank Speck, an anthropology professor at the University of Pennsylvania who was Eiseley's adviser, then colleague, and to whose endowed chair Eiseley succeeded upon Speck's retirement. (It is a curious coincidence that all Freudians will love that Eiseley's first published book was a biography of Fancis Bacon entitled The Man Who Saw Through Time; Francis Bacon and Frank Speck are English and German translations of each other.)
Eiseley described his encounter with the ghost of his last magician:
"I was fifty years old when my youth ended, and it was, of all unlikely places, within that great unwieldy structure built to last forever and then hastily to be torn down—the Pennsylvania Station in New York. I had come in through a side doorway and was slowly descending a great staircase ina slanting shaft of afternoon sunlight. Distantly I became aware of a man loitering at the bottom of the steps, as though awaiting me there. As I descended he swung about and began climbing toward me.Eiseley had seen a ghost. His mind fixed on the terror he felt at encountering Speck's ghost. They had been friends. Why had he felt afraid?
"On the slow train running homeward the answer came. I had been away for ten years from the forest. I had had no messages from its depths.... I had been immersed in the postwar administrative life of a growing university. But all the time some accusing spirit, the familiar of the last wood-struck magician, had lingered in my brain. Finally exteriorized, he had stridden up the stair to confront me in the autumn light. Whether he had been imposed in some fashion upon a convenient facsimile or was a genuine illusion was of little importance compared to the message he had brought. I had starved and betrayed myself. It was this that had brought the terror. For the first time in years I left my office in midafternoon and sought the sleeping silence of a nearby cemetery. I was as pale and drained as the Indian pipe plants without chlorophyll that rise after rains on the forest floor. It was time for a change. I wrote a letter and studied timetables. I was returning to the land that bore me." (P. 139)Whenever I am at my worst —- rash, hostile, refusing to listen, unwilling even to try to understand -- something tugs at me from somewhere at the back of consciousness, asking me to be better than that, to be more like this person or that person I admire. Bob Stake and I are opposites on most dimensions that I can imagine. I form judgments prematurely; he is slow to judge. I am impetuous; he is reflective. I talk too much; perhaps he talks not enough. I change my persona every decade; his seemingly never changes. And yet, Bob has always been for me a hidden teacher.
This is the text of remarks delivered in part on the occasion of a symposium honoring the retirement of Robert E. Stake, University of Illinois—UC. May 9, 1998 in Urbana, Illinois.
Eiseley, Loren (1970). The Invisible Pyramid. New York: Scribner.
Thursday, July 21, 2022
Saturday, July 2, 2022
Meta-analysis at 25: A Personal History
A Personal History
Gene V Glass
Arizona State University
It has been nearly 25 years since meta-analysis, under that name and in its current guise made its first appearance. I wish to avoid the weary references to the new century or millenium—depending on how apocalyptic you're feeling (besides, it's 5759 on my calendar anyway)—and simply point out that meta-analysis is at the age when most things graduate from college, so it's not too soon to ask what accounting can be made of it. I have refrained from publishing anything on the topic of the methods of meta-analysis since about 1980 out of a reluctance to lay some heavy hand on other people's enthusiasms and a wish to hide my cynicism from public view. Others have eagerly advanced its development and I'll get to their contributions shortly (Cooper & Hedges, 1994; Hedges and Olkin, 1985; Hunter, Schmidt and Jackson, 1982).
Autobiography may be the truest, most honest narrative, even if it risks self-aggrandizement, or worse, self-deception. Forgive me if I risk the latter for the sake of the former. For some reason it is increasingly difficult these days to speak in any other way.
In the span of this rather conventional paper, I wish to review the brief history of the form of quantitative research synthesis that is now generally known as "meta-analysis" (though I can't possibly recount this history as well as has Morton Hunt (1997) in his new book How Science Takes Stock: The story of Meta-Analysis), tell where it came from, why it happened when it did, what was wrong with it and what remains to be done to make the findings of research in the social and behavioral sciences more understandable and useful.
In 25 years, meta-analysis has grown from an unheard of preoccupation of a very small group of statisticians working on problems of research integration in education and psychotherapy to a minor academic industry, as well as a commercial endeavor (see http://epidemiology.com/ and http://members.tripod.com/~Consulting_Unlimited/, for example). A keyword web search—the contemporary measure of visibility and impact—(Excite, January 28, 2000) on the word "meta-analysis" brings 2,200 "hits" of varying degrees of relevance, of course. About 25% of the articles in the Psychological Bulletin in the past several years have the term "meta-analysis" in the title. Its popularity in the social sciences and education is nothing compared to its influence in medicine, where literally hundreds of meta-analyses have been published in the past 20 years. (In fact, my internist quotes findings of what he identifies as published meta-analyses during my physical exams.) An ERIC search shows well over 1,500 articles on meta-analyses written since 1975.
Surely it is true that as far as meta-analysis is concerned, necessity was the mother of invention, and if it hadn't been invented—so to speak—in the early 1970s it would have been invented soon thereafter since the volume of research in many fields was growing at such a rate that traditional narrative approaches to summarizing and integrating research were beginning to break down. But still, the combination of circumstances that brought about meta-analysis in about 1975 may itself be interesting and revealing. There were three circumstances that influenced me.
The first was personal. I left the University of Wisconsin in 1965 with a brand new PhD in psychometrics and statistics and a major league neurosis—years in the making—that was increasingly making my life miserable. Luckily, I found my way into psychotherapy that year while on the faculty of the University of Illinois and never left it until eight years later while teaching at the University of Colorado. I was so impressed with the power of psychotherapy as a means of changing my life and making it better that by 1970 I was studying clinical psychology (with the help of a good friend and colleague Vic Raimy at Boulder) and looking for opportunities to gain experience doing therapy.
In spite of my personal enthusiasm for psychotherapy, the weight of academic opinion at that time derived from Hans Eysenck's frequent and tendentious reviews of the psychotherapy outcome research that proclaimed psychotherapy as worthless—a mere placebo, if that. I found this conclusion personally threatening—it called into question not only the preoccupation of about a decade of my life but my scholarly judgment (and the wisdom of having dropped a fair chunk of change) as well. I read Eysenck's literature reviews and was impressed primarily with their arbitrariness, idiosyncrasy and high-handed dismissiveness. I wanted to take on Eysenck and show that he was wrong: psychotherapy does change lives and make them better.
The second circumstance that prompted meta-analysis to come out when it did had to do with an obligation to give a speech. In 1974, I was elected President of the American Educational Research Association, in a peculiar miscarriage of the democratic process. This position is largely an honorific title that involves little more than chairing a few Association Council meetings and delivering a "presidential address" at the Annual Meeting. It's the "presidential address" that is the problem. No one I know who has served as AERA President really feels that they deserved the honor; the number of more worthy scholars passed over not only exceeds the number of recipients of the honor by several times, but as a group they probably outshine the few who were honored. Consequently, the need to prove one's worthiness to oneself and one's colleagues is nearly overwhelming, and the most public occasion on which to do it is the Presidential address, where one is assured of an audience of 1,500 or so of the world's top educational researchers. Not a few of my predecessors and contemporaries have cracked under this pressure and succumbed to the temptation to spin out grandiose fantasies about how educational research can become infallible or omnipotent, or about how government at national and world levels must be rebuilt to conform to the dictates of educational researchers. And so I approached the middle of the 1970s knowing that by April 1976 I was expected to release some bombast on the world that proved my worthiness for the AERA Presidency, and knowing that most such speeches were embarrassments spun out of feelings of intimidation and unworthiness. (A man named Richard Krech, I believe, won my undying respect when I was still in graduate school; having been distinguished by the American Psychological Association in the 1960s with one of its highest research awards, Krech, a professor at Berkeley, informed the Association that he was honored, but that he had nothing particularly new to report to the organization at the obligatory annual convention address, but if in the future he did have anything worth saying, they would hear it first.)
The third set of circumstances that joined my wish to annihilate Eysenck and prove that psychotherapy really works and my need to make a big splash with my Presidential Address was that my training under the likes of Julian Stanley, Chester Harris, Henry Kaiser and George E. P. Box at Wisconsin in statistics and experimental design had left me with a set of doubts and questions about how we were advancing the empirical agenda in educational research. In particular, I had learned to be very skeptical of statistical significance testing; I had learned that all research was imperfect in one respect or another (or, in other words, there are no "perfectly valid" studies nor any line that demarcates "valid" from "invalid" studies); and third, I was beginning to question a taken-for-granted assumption of our work that we progress toward truth by doing what everyone commonly refers to as "studies." (I know that these are complex issues that need to be thoroughly examined to be accurately communicated, and I shall try to return to them.) I recall two publications from graduate school days that impressed me considerably. One was a curve relating serial position of a list of items to be memorized to probability of correct recall that Benton Underwood (1957) had synthesized from a dozen or more published memory experiments. The other was a Psychological Bulletin article by Sandy Astin on the effects of glutamic acid on mental performance (whose results presaged a meta-analysis of the Feingold diet research 30 years later in that poorly controlled experiments showed benefits and well controlled experiments did not).
Permit me to say just a word or two about each of these studies because they very much influenced my thinking about how we should "review" research. Underwood had combined the findings of 16 experiments on serial learning to demonstrate a consistent geometrically decreasing curve describing the declining probability of correct recall as a function of number of previously memorized items, thus giving strong weight to an interference explanation of recall errors. What was interesting about Underwood's curve was that it was an amalgamation of studies that had different lengths of lists and different items to be recalled (nonsense syllables, baseball teams, colors and the like).
Astin's Psychological Bulletin review had attracted my attention in another respect. Glutamic acid—it will now scarcely be remembered—was a discovery of the 1950s that putatively increased the ability of tissue to absorb oxygen. Reasoning with the primitive constructs of the time, researchers hypothesized that more oxygen to the brain would produce more intelligent behavior. (It is not known what amount of oxygen was reaching the brains of the scientists proposing this hypothesis.) A series of experiments in the 1950s and 1960s tested glutamic acid against "control groups" and by 1961, Astin was able to array these findings in a crosstabulation that showed that the chances of finding a significant effect for glutamic acid were related (according to a chi-square test) to the presence or absence of various controls in the experiment; placebos and blinding of assessors, for example, were associated with no significant effect of the acid. As irrelevant as the chi-square test now seems, at the time I saw it done, it was revelatory to see "studies" being treated as data points in a statistical analysis. (In 1967, I attempted a similar approach while reviewing the experimental evidence on the Doman-Delacato pattern therapy. Glass and Robbins, 1967)
At about the same time I was reading Underwood and Astin, I certainly must have read Ben Bloom's Stability of Human Characteristics (1963), but its aggregated graphs of correlation coefficients made no impression on me, because it was many years after work to be described below that I noticed a similarlity between his approach and meta-analysis. Perhaps the connections were not made because Bloom dealt with variables such as age, weight, height, IQ and the like where the problems of dissimilarity of variables did not force one to worry about the kinds of problem that lie at the heart of meta-analysis.
If precedence is of any concern, Bob Rosenthal deserves as much credit as anyone for furthering what we now conveniently call "meta-analysis." In 1976, he published Experimenter Effects in Behvaioral Research, which contained calculations of many "effect sizes" (i.e., standardized mean differences) that were then compared across domains or conditions. If Bob had just gone a little further in quantifying study characteristics and subjecting the whole business to regression analyses and what-not, and then thinking up a snappy name, it would be his name that came up every time the subject is research integration. But Bob had an even more positive influence on the development of meta-analysis than one would infer from his numerous methodological writings on the subject. When I was making my initial forays onto the battlefield of psychotherapy outcome research—about which more soon—Bob wrote me a very nice and encouraging letter in which he indicated that the approach we were taking made perfect sense. Of course, it ought to have made sense to him, considering that it was not that different from what he had done in Experimenter Effects. He probably doesn't realize how important that validation from a stranger was. (And while on the topic of snappy names, although people have suggested or promoted several polysyllabic alternatives—quantitative synthesis, statistical research integration—the name meta- analysis, suggested by Michael Scriven's meta-evaluation (meaning the evaluation of evaluations), appears to have caught on. To press on further into it, the "meta" comes from the Greek preposition meaning "behind" or "in back of." Its application as in "metaphysics" derives from the fact that in the publication of Aristotle's writings during the Middle Ages, the section dealing with the transcendental was bound immediately behind the section dealing with physics; lacking any title provided by its author, this final section became known as Aristotle's "metaphysics." So, in fact, metaphysics is not some grander form of physics, some all encompassing, overarching general theory of everthing; it is merely what Aristotle put after the stuff he wrote on physics. The point of this aside is to attempt to leach out of the term "meta-analysis" some of the grandiosity that others see in it. It is not the grand theory of research; it is simply a way of speaking of the statistical analysis of statistical analyses.)
So positioned in these circumstances, in the summer of 1974, I set about to do battle with Dr. Eysenck and prove that psychotherapy—my psychotherapy—was an effective treatment. (Incidentally, though it may be of only the merest passing interest, my preferences for psychotherapy are Freudian, a predilection that causes Ron Nelson and other of my ASU colleagues great distress, I'm sure.) I joined the battle with Eysenck's 1965 review of the psychotherapy outcome literature. Eysenck began his famous reviews by eliminating from consideration all theses, dissertations, project reports or other contemptible items not published in peer-reviewed journals. This arbitrary exclusion of literally hundreds of evaluations of therapy outcomes was indefensible. It's one thing to believe that peer review guarantees truth; it is quite another to believe that all truth appears in peer reviewed journals. (The most important paper on the multiple comparisons problem in ANOVA was distributed as an unpublished ditto manuscript from the Princeton University Mathematics Department by John Tukey; it never was published in a peer reviewed journal.)
Next, Eysenck eliminated any experiment that did not include an untreated control group. This makes no sense whatever, since head-to-head comparisons of two different types of psychotherapy contribute a great deal to our knowledge of psychotherapy effects. If a horse runs 20 mph faster than a man and 35 mph faster than a pig, I can conclude with confidence that the man will outrun the pig by 15 mph. Having winnowed a huge literature down to 11 studies (!) by whim and prejudice, Eysenck proceeded to describe their findings soley in terms of whether or not statistical significance was attained at the .05 level. No matter that the results may have barely missed the .05 level or soared beyond it. All that Eysenck considered worth noting about an experiment was whether the differences reached significance at the .05 level. If it reached significance at only the .07 level, Eysenck classified it as showing "no effect for psychotherapy."
Finally, Eysenck did something truly staggering in its illogic. If a study showed significant differences favoring therapy over control on what he regarded as a "subjective" measure of outcome (e.g., the Rorschach or the Thematic Apperception Test), he discounted the findings entirely. So be it; he may be a tough case, but that's his right. But then, when encountering a study that showed differences on an "objective" outcome measure (e.g., GPA) bit no differences on a subjective measure (like the TAT), Eysenck discounted the entire study because the outcome differences were "inconsistent."
Looking back on it, I can almost credit Eysenck with the invention of meta-analysis by anti-thesis. By doing everything in the opposite way that he did, one would have been led straight to meta-analysis. Adopt an a posteriori attitude toward including studies in a synthesis, replace statistical significance by measures of strength of relationship or effect, and view the entire task of integration as a problem in data analysis where "studies" are quantified and the resulting data-base subjected to statistical analysis, and meta-analysis assumes its first formulation. (Thank you, Professor Eysenck.)
Working with my colleague Mary Lee Smith, I set about to collect all the psychotherapy outcome studies that could be found and subjected them to this new form of analysis. By May of 1975, the results were ready to try out on a friendly group of colleagues. The May 12th Group had been meeting yearly since about 1968 to talk about problems in the area of program evaluation. The 1975 meeting was held in Tampa at Dick Jaeger's place. I worked up a brief handout and nervously gave my friends an account of the preliminary results of the psychotherapy meta-analysis. Lee Cronbach was there; so was Bob Stake, David Wiley, Les McLean and other trusted colleagues who could be relied on to demolish any foolishness they might see. To my immense relief they found the approach plausible or at least not obviously stupid. (I drew frequently in the future on that reassurance when others, whom I respected less, pronounced the entire business stupid.)
The first meta-analysis of the psychotherapy outcome research found that the typical therapy trial raised the treatment group to a level about two-thirds of a standard deviation on average above untreated controls; the average person receiving therapy finished the experiment in a position that exceeded the 75th percentile in the control group on whatever outcome measure happened to be taken. This finding summarized dozens of experiments encompassing a few thousand persons as subjects and must have been cold comfort to Professor Eysenck.
An expansion and reworking of the psychotherapy experiments resulted in the paper that was delivered as the much feared AERA Presidential address in April 1976. Its reception was gratifying. Two months later a long version was presented at a meeting of psychotherapy researchers in San Diego. Their reactions foreshadowed the eventual reception of the work among psychologists. Some said that the work was revolutionary and proved what they had known all along; others said it was wrongheaded and meaningless. The widest publication of the work came in 1977, in a now, may I say, famous article by Smith and Glass in the American Psychologist. Eysenck responded to the article by calling it "mega-silliness," a moderately clever play on meta- analysis that nonetheless swayed few.
Psychologists tended to fixate on the fact that the analysis gave no warrant to any claims that one type or style of psychotherapy was any more effective than any other: whether called "behavioral" or "Rogerian" or "rational" or "psychodynamic," all the therapies seemed to work and to work to about the same degree of effectiveness. Behavior therapists, who had claimed victory in the psychotherapy horserace because they were "scientific" and others weren't, found this conclusion unacceptable and took it as reason enough to declare meta-analysis invalid. Non- behavioral therapists—the Rogerians, Adlerians and Freudians, to name a few—hailed the meta-analysis as one of the great achievements of psychological research: a "classic," a "watershed." My cynicism about research and much of psychology dates from approximately this period.
The first appearances of meta-analysis in the 1970s were not met universally with encomiums and expressions of gratitude. There was no shortage of critics who found the whole idea wrong-headed, senseless, misbegotten, etc.
The Apples-and-Oranges Problem
Of course the most often repeated criticism of meta-analysis was that it was meaningless because it "mixed apples and oranges." I was not unprepared for this criticism; indeed, I had long before prepared my own defense: "Of course it mixes apples and oranges; in the study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor worthy of true scientists; comparing apples to apples is trivial." But I misjudged the degree to which this criticism would take hold of people's opinions and shut down their minds. At times I even began to entertain my own doubts that it made sense to integrate any two studies unless they were studies of "the same thing." But, the same persons who were arguing that no two studies should be compared unless they were studies of the "same thing," were blithely comparing persons (i.e., experimental "subjects") within their studies all the time. This seemed inconsistent. Plus, I had a glimmer of the self-contradictory nature of the statement "No two things can be compared unless they are the same." If they are the same, there is no reason to compare them; indeed, if "they" are the same, then there are not two things, there is only one thing and comparison is not an issue. And yet I had a gnawing insecurity that the critics might be right. One study is an apple, and a second study is an orange; and comparing them is as stupid as comparing apples and oranges, except that sometimes I do hesitate while considering whether I'm hungry for an apple or an orange.
At about this time—late 1970s—I was browsing through a new book that I had bought out of a vague sense that it might be worth my time because it was written by a Harvard philosopher, carried a title like Philosophical Explanations and was written by an author—Robert Nozick—who had written one of the few pieces on the philosophy of the social sciences that ever impressed me as being worth rereading. To my amazement, Nozick spent the first one hundred pages of his book on the problem of "identity," i.e., what does it mean to say that two things are the same? Starting with the puzzle of how two things that are alike in every respect would not be one thing, Nozick unraveled the problem of identity and discovered its fundamental nature underlying a host of philosophical questions ranging from "How do we think?" to "How do I know that I am I?" Here, I thought at last, might be the answer to the "apples and oranges" question. And indeed, it was there.
Nozick considered the classic problem of Theseus's ship. Theseus, King of Thebes, and his men are plying the waters of the Mediterranean. Each day a sailor replaces a wooden plank in the ship. After nearly five years, every plank has been replaced. Are Theseus and his men still sailing in the same ship that was launched five years earlier on the Mediterranean? "Of course," most will answer. But suppose that as each original plank was removed, it was taken ashore and repositioned exactly as it had been on the waters, so that at the end of five years, there exists a ship on shore, every plank of which once stood in exactly the same relationship to every other in what five years earlier had been Theseus's ship. Is this ship on shore—which we could easily launch if we so chose—Theseus's ship? Or is the ship sailing the Mediterranean with all of its new planks the same ship that we originally regarded as Theseus's ship? The answer depends on what we understand the concept of "same" to mean?
Consider an even more troubling example that stems from the problem of the persistence of personal identity. How do I know that I am that person who I was yesterday, or last year, or twenty-five years ago? Why would an old high-school friend say that I am Gene Glass, even though hundreds, no thousands of things about me have changed since high school? Probably no cells are in common between this organism and the organism that responded to the name "Gene Glass" forty years ago; I can assure you that there are few attitudes and thoughts held in common between these two organisms—or is it one organism? Why then, would an old high-school friend, suitably prompted, say without hesitation, "Yes, this is Gene Glass, the same person I went to high school with." Nozick argued that the only sense in which personal identity survives across time is in the sense of what he called "the closest related continuer." I am still recognized as Gene Glass to those who knew me then because I am that thing most closely related to that person to whom they applied the name "Gene Glass" over forty years ago. Now notice that implied in this concept of the "closest related continuer" are notions of distance and relationship. Nozick was quite clear that these concepts had to be given concrete definition to understand how in particular instances people use the concept of identity. In fact, to Nozick's way of thinking, things are compared by means of weighted functions of constituent factors, and their "distance" from each other is "calculated" in many instances in a Euclidean way.
Consider Theseus's ship again. Is the ship sailing the seas the "same" ship that Theseus launched five years earlier? Or is the ship on the shore made of all the original planks from that first ship the "same" as Theseus's original ship? If I give great weight to the materials and the length of time those materials functioned as a ship (i.e., to displace water and float things), then the vessel on the shore is the closest related continuation of what historically had been called "Theseus's ship." But if, instead, I give great weight to different factors such as the importance of the battles the vessel was involved in (and Theseus's big battles were all within the last three years), then the vessel that now floats on the Mediterranean—not the ship on the shore made up of Theseus's original planks—is Theseus's ship, and the thing on the shore is old spare parts.
So here was Nozick saying that the fundamental riddle of how two things could be the same ultimately resolves itself into an empirical question involving observable factors and weighing them in various combinations to determine the closest related continuer. The question of "sameness" is not an a priori question at all; apart from being a logical impossibility, it is an empirical question. For us, no two "studies" are the same. All studies differ and the only interesting questions to ask about them concern how they vary across the factors we conceive of as important. This notion is not fully developed here and I will return to it later.
The "Flat Earth" Criticism
I may not be the best person to critique meta-analysis, for obvious reasons. However, I will cop to legitimate criticisms of the approach when I see them, and I haven't seen many. But one criticism rings true because I knew at the time that I was being forced into a position with which I wasn't comfortable. Permit me to return to the psychotherapy meta-analysis.
Eysenck was, as I have said, a nettlesome critic of the psychotherapy establishment in the 1960s and 1970s. His exaggerated and inflammatory statements about psychotherapy being worthless (no better than a placebo) were not believed by psychotherapists or researchers, but they were not being effectively rebutted either. Instead of taking him head-on, as my colleagues and I attempted to do, researchers, like Gordon Paul, for example, attempted to argue that the question whether psychotherapy was effective was fundamentally meaningless. Rather, asserted Paul while many others assented, the only legitimate research question was "What type of therapy, with what type of client, produces what kind of effect?" I confess that I found this distracting dodge as frustrating as I found Eysenck's blanket condemnation. Here was a critic—Eysenck—saying that all psychotherapists are either frauds or gullible, self-deluded incompetents, and the establishment's response is to assert that he is not making a meaningful claim. Well, he was making a meaningful claim; and I already knew enough from the meta-analysis of the outcome studies to know that Paul's question was unanswerable due to insufficient data, and that reseacrhers were showing almost no interest in collecting the kind of data that Paul and others argued were the only meaningful data.
It fell to me, I thought, to argue that the general question "Is psychotherapy effective?" is meaningful and that psychotherapy is effective. Such generalizations—across types of therapy, types of client and types of outcome—are meaningful to many people—policy makers, average citizens—if not to psychotherapy researchers or psychotherapists themselves. It was not that I necessarily believed that different therapies did not have different effects for different kinds of people; rather, I felt certain that the available evidence, tons of it, did not establish with any degree of confidence what these differential effects were. It was safe to say that in general psychotherapy works on many things for most people, but it was impossible to argue that this therapy was better than that therapy for this kind of problem. (I might add that twenty years after the publication of The Benefits of Psychotherapy, I still have not seen compelling answers to Paul's questions, nor is their evidence of researchers having any interest in answering them.)
The circumstances of the debate, then, put me in the position of arguing, circa 1980, that there are very few differences among various ways of treating human beings and that, at least, there is scarcely any convincing experimental evidence to back up claims of differential effects. And that policy makers and others hardly need to waste their time asking such questions or looking for the answers. Psychotherapy works; all types of therapy work about equally well; support any of them with your tax dollars or your insurance policies. Class size reductions work—very gradually at first (from 30 to 25 say) but more impressively later (from 15 to 10); they work equally for all grades, all subjects, all types of student. Reduce class sizes, and it doesn't matter where or for whom.
Well, one of my most respected colleagues called me to task for this way of thinking and using social science research. In a beautiful and important paper entitled "Prudent Aspirations for the Social Sciences," Lee Cronbach chastised his profession for promising too much and chastised me for expecting too little. He lumped me with a small group of like-minded souls into what he named the "Flat Earth Society," i.e., a group of people who believe that the terrain that social scientists explore is featureless, flat, with no interesting interactions or topography. All therapies work equally well; all tests predict success to about the same degree; etc.:
"...some of our colleagues are beginning to sound like a kind of Flat Earth Society. They tell us that the world is essentially simple: most social phenomena are adequately described by linear relations; one-parameter scaling can discover coherent variables independent of culture and population; and inconsistences among studies of the same kind will vanish if we but amalgamate a sufficient number of studies.... The Flat Earth folk seek to bury any complex hypothesis with an empirical bulldozer." (Cronbach, 1982, p. 70.)Cronbach's criticism stung because it was on target. In attempting to refute Eysenck's outlandishness without endorsing the psychotherapy establishment's obfuscation, I had taken a position of condescending simplicity. A meta-analysis will give you the BIG FACT, I said; don't ask for more sophisticated answers; they aren't there. My own work tended to take this form, and much of what has ensued in the past 25 years has regrettably followed suit. Effect sizes—if it is experiments that are at issue—are calculated, classified in a few ways, perhaps, and all their variability is then averaged across. Little effort is invested in trying to plot the complex, variegated landscape that most likely underlies our crude averages.
Consider an example that may help illuminate these matters. Perhaps the most controversial conclusion from the psychotherapy meta-analysis that my colleagues and I published in 1980 was that there was no evidence favoring behavioral psychotherapies over non-behavioral psychotherapies. This finding was vilified by the behavioral therapy camp and praised by the Rogerians and Freudians. Some years later, prodded by Cronbach's criticism, I returned to the database and dug a little deeper. What I found appears in Figure 2. When the nine experiments extant in 1979—and I would be surprised if there are many more now—in which behavioral and non-behavioral psychotherapies are compared in the same experiment between randomized groups and the effects of treatment are plotted as a function of follow-up time, the two curves in Figure 2 result. The findings are quite extraordinary and suggestive. Behavioral therapies produce large short-term effects which decay in strength over the first year of follow-up; non- behavioral therapies produce initially smaller effects which increase over time. The two curves appear to be converging on the same long-term effect. I leave it to the reader to imagine why. One answer, I suspect, is not arcane and is quite plausible.
In the twenty-five years between the first appearance of the word "meta-analysis" in print and today, there have been several attempts to modify the approach, or advance alternatives to it, or extend the method to reach auxiliary issues. If I may be so cruel, few of efforts have added much. One of the hardest things to abide in following the developments in meta-analysis methods in the past couple of decades was the frequent observation that what I had contributed to the problem of research synthesis was the idea of dividing mean differences by standard deviations. "Effect sizes," as they are called, had been around for decades before I opened my first statistics text. Having to read that "Glass has proposed integrating studies by dividing mean differences by standard deviations and averaging them" was a bitter pill to swallow. Some of the earliest work that I and my colleagues did involved using a variety of outcome measures to be analyzed and synthesized: correlations, regression coefficients, proportions, odds ratios. Well, so be it; better to be mentioned in any favorable light than not to be remembered at all.
After all, this was not as hard to take as newly minted confections such as "best evidence research synthesis," a come-lately contribution that added nothing whatsoever to what myself and many others had been saying repeatedly on the question of whether meta- analyses should use all studies or only "good" studies. I remain staunchly committed to the idea that meta-analyses must deal with all studies, good bad and indifferent, and that their results are only properly understood in the context of each other, not after having been censored by some a priori set of prejudices. An effect size of 1.50 for 20 studies employing randomized groups has a whole different meaning when 50 studies using matching show an average effect of 1.40 than if 50 matched groups studies show an effect of -.50, for example.
The appropriate role for inferential statistics in meta-analysis is not merely unclear, it has been seen quite differently by different methodologists in the 25 years since meta- analysis appeared. In 1981, in the first extended discussion of the topic (Glass, McGaw and Smith, 1981), I raised doubts about the applicability of inferential statistics in meta- analysis. Inference at the level of persons within seemed quite unnecessary, since even a modest size synthesis will involve a few hundred persons (nested within studies) and lead to nearly automatic rejection of null hypotheses. Moreover, the chances are remote that the persons or subjects within studies were drawn from defined populations with anything even remotely resembling probabilistic techniques. Hence, probabilistic calculations advanced as if subjects had been randomly selected would be dubious. At the level of "studies," the question of the appropriateness of inferential statistics can be posed again, and the answer again seems to be negative. There are two instances in which common inferential methods are clearly appropriate, not just in mata-analysis but in any research: 1) when a well defined population has been randomly sampled, and 2) when subjects have been randomly assigned to conditions in a controlled experiment. In the latter case, Fisher showed how the permutation test can be used to make inferences to the universe of all possible permutations. But this case is of little interest to meta-analysts who never assign units to treatments. Moreover, the typical meta-analysis virtually never meets the condition of probabilistic sampling of a population (though in one instance (Smith, Glass & Miller, 1980), the available population of psychoactive drug treatment experiments was so large that a random sample of experiments was in fact drawn for the meta- analysis). Inferential statistics has little role to play in meta-analysis
It is common to acknowledge, in meta-analysis and elsewhere, that many data sets fail to meet probabilistic sampling conditions, and then to argue that one ought to treat the data in hand "as if" it were a random sample of some hypothetical population. One must be wary here of the slide from "hypothesis about a population" into "a hypothetical population." They are quite different things, the former being standard and unobjectionable, the latter being a figment with which we hardly know how to deal. Under this stipulation that one is making inferences not to some defined or known population but a hypothetical one, inferential techniques are applied and the results inspected. The direction taken mirrors some of the earliest published opinion on this problem in the context of research synthesis, expressed, for example, by Mosteller and his colleagues in 1977: "One might expect that if our MEDLARS approach were perfect and produced all the papers we would have a census rather than a sample of the papers. To adopt this model would be to misunderstand our purpose. We think of a process producing these research studies through time, and we think of our sample—even if it were a census—as a sample in time from the process. Thus, our inference would still be to the general process, even if we did have all appropriate papers from a time period." (Gilbert, McPeek and Mosteller, 1977, p. 127; quoted in Cook et al., 1992, p. 291) This position is repeated in slightly different language by Larry Hedges in Chapter 3 "Statistical Considerations" of the Handbook of Research Synthesis (1994): "The universe is the hypothetical collection of studies that could be conducted in principle and about which we wish to generalize. The study sample is the ensemble of studies that are used in the review and that provide the effect size data used in the research synthesis." (p. 30)
These notions appear to be circular. If the sample is fixed and the population is allowed to be hypothetical, then surely the data analyst will imagine a population that resembles the sample of data. If I show you a handful of red and green M&Ms, you will naturally assume that I have just drawn my hand out of a bowl of mostly red and green M&Ms, not red and green and brown and yellow ones. Hence, all of these "hypothetical populations" will be merely reflections of the samples in hand and there will be no need for inferential statistics. Or put another way, if the population of inference is not defined by considerations separate from the characterization of the sample, then the population is merely a large version of the sample. With what confidence is one able to generalize the character of this sample to a population that looks like a big version of the sample? Well, with a great deal of confidence, obviously. But then, the population is nothing but the sample writ large and we really know nothing more than what the sample tells us in spite of the fact that we have attached misleadingly precise probability numbers to the result.
Hedges and Olkin (1985) have developed inferential techniques that ignore the pro forma testing (because of large N) of null hypotheses and focus on the estimation of regression functions that estimate effects at different levels of study. They worry about both sources of statistical instability: that arising from persons within studies and that which arises from variation between studies. The techniques they present are based on traditional assumptions of random sampling and independence. It is, of course, unclear to me precisely how the validity of their methods are compromised by failure to achieve probabilistic sampling of persons and studies.
The irony of traditional hypothesis testing approaches applied to meta-analysis is that whereas consideration of sampling error at the level of persons always leads to a pro forma rejection of "null hypotheses" (of zero correlation or zero average effect size), consideration of sampling error at the level of study characteristics (the study, not the person as the unit of analysis) leads to too few rejections (too many Type II errors, one might say). Hedges's homogeneity test of the hypothesis that all studies in a group estimate the same population parameter frequently seen in published meta-analyses these days. Once a hypothesis of homogeneity is accepted by Hedges's test, one is advised to treat all studies within the ensemble as the same. Experienced data analysts know, however, that there is typically a good deal of meaningful covariation between study characteristics and study findings even within ensembles where Hedges's test can not reject the homogeneity hypothesis. The situation is parallel to the experience of psychometricians discovering that they could easily interpret several more common factors than inferential solutions (maximum- likelihood; LISREL) could confirm. The best data exploration and discovery are more complex and convincing than the most exact inferential test. In short, classical statistics seems not able to reproduce the complex cognitive processes that are commonly applied with success by data analysts.
Donald Rubin (1990) addressed some of these issues squarely and articulated a position that I find very appealing : "...consider the idea that sampling and representativeness of the studies in a meta-analysis are important. I will claim that this is nonsense—we don't have to worry about representing a population but rather about other far more important things." (p. 155) These more important things to Rubin are the estimation of treatment effects under a set of standard or ideal study conditions. This process, as he outlined it, involves the fitting of response surfaces (a form of quantitative model building) between study effects (Y) and study conditions (X, W, Z etc.). I would only add to Rubin's statement that we are interested in not merely the response of the system under ideal study conditions but under many conditions having nothing to do with an ideally designed study, e.g., person characteristics, follow-up times and the like.
By far most meta-analyses are undertaken in pursuit not of scientific theory but technological evaluation. The evaluation question is never whether some hypothesis or model is accepted or rejected but rather how "outputs" or "benefits" or "effect sizes" vary from one set of circumstances to another; and the meta-analysis rarely works on a collection of data that can sensibly be described as a probability sample from anything.
If our efforts to research and improve education are to prosper, meta-analysis will have to be replaced by more useful and more accurate ways of synthesizing research findings. To catch a glimpse of what this future for research integration might look like, we need to look back at the deficiencies in our research customs that produced meta-analysis in the first place.
First, the high cost in the past of publishing research results led to cryptic reporting styles that discarded most of the useful information that research revealed. To encapsulate complex relationships in statements like "significant at the .05 level" was a travesty—a travesty that continues today out of bad habit and bureaucratic inertia.
Second, we need to stop thinking of ourselves as scientists testing grand theories, and face the fact that we are technicians collecting and collating information, often in quantitative forms. Paul Meehl (1967; 1978) dispelled once and for all the misconception that we in, what he called, the "soft social sciences" are testing theories in any way even remotely resembling how theory focuses and advances research in the hard sciences. Indeed, the mistaken notion that we are theory driven has, in Meehl's opinion, led us into a worthless pro forma ritual of testing and rejecting statistical hypotheses that are a priori known to be 99% false before they are tested.
Third, the conception of our work that held that "studies" are the basic, fundamental unit of a research program may be the single most counterproductive influence of all. This idea that we design a "study," and that a study culminates in the test of a hypothesis and that a hypothesis comes from a theory—this idea has done more to retard progress in educational research than any other single notion. Ask an educational researcher what he or she is up to, and they will reply that they are "doing a study," or "designing a study," or "writing up a study" for publication. Ask a physicist what's up and you'll never hear the word "study." (In fact, if one goes to http://xxx.lanl.gov where physicists archive their work, one will seldom see the word "study." Rather, physicists—the data gathering experimental ones—report data, all of it, that they have collected under conditions that they carefully described. They contrive interesting conditions that can be precisely described and then they report the resulting observations.)
Meta-analysis was created out of the need to extract useful information from the cryptic records of inferential data analyses in the abbreviated reports of research in journals and other printed sources. "What does this t-test really say about the efficacy of ritalin in comparison to caffeine?" Meta-analysis needs to be replaced by archives of raw data that permit the construction of complex data landscapes that depict the relationships among independent, dependent and mediating variables. We wish to be able to answer the question, "What is the response of males ages 5-8 to ritalin at these dosage levels on attention, acting out and academic achievement after one, three, six and twelve months of treatment?"
We can move toward this vision of useful synthesized archives of research now if we simply re-orient our ideas about what we are doing when we do research. We are not testing grand theories, rather we are charting dosage-response curves for technological interventions under a variety of circumstances. We are not informing colleagues that our straw-person null hypothesis has been rejected at the .01 level, rather we are sharing data collected and reported according to some commonly accepted protocols. We aren't publishing "studies," rather we are contributing to data archives.
Five years ago, this vision of how research should be reported and shared seemed hopelessly quixotic. Now it seems easily attainable. The difference is the I-word: the Internet. In 1993, spurred by the ludicrously high costs and glacial turn-around times of traditional scholarly journals, I created an internet-based peer-reviewed journal on education policy analysis (http://epaa.asu.edu). This journal, named Education Policy Analysis Archives, is now in its seventh year of publication, has published 150 articles, is accessed daily without cost by nearly 1,000 persons (the other three paper journals in this field have average total subscription bases of fewer than 1,000 persons), and has an average "lag" from submission to publication of about three weeks. Moreover, we have just this year started accepting articles in both English and Spanish. And all of this has been accomplished without funds other than the time I put into it as part of my normal job: no secretaries, no graduate assistants, nothing but a day or two a week of my time.
Two years ago, we adopted the policy that any one publishing a quantitative study in the journal would have to agree to archive all the raw data at the journal website so that the data could be downloaded by any reader. Our authors have done so with enthusiasm. I think that you can see how this capability puts an entirely new face on the problem of how we integrate research findings: no more inaccurate conversions of inferential test statistics into something worth knowing like an effect size or a correlation coefficient or an odds ratio; no more speculating about distribution shapes; no more frustration at not knowing what violence has been committed when linear coefficients mask curvilinear relationships. Now we simply download each others' data, and the synthesis prize goes to the person who best assembles the pieces of the jigsaw puzzle into a coherent picture of how the variables relate to each other.
Cook, T.D. Meta-analysis for explanation — a casebook. New York: Russell Sage Foundation; 1992.
Cooper, H.M. (1989). Integrating research: a guide for literature reviews. 2nd ed. Newbury Park, CA: SAGE Publications.
Cooper, H.M. and Hedges, L. V. (Eds.) (1994). The handbook of research synthesis. New York: Russell Sage Foundation.
Cronbach, L.J. (1982). Prudent Aspirations for Social Inquiry. Chapter 5 (Pp. 61-81) in Kruskal, W.H. (Ed.), The social sciences: Their nature and uses. Chicago: The University of Chicago Press.
Eysenck, H.J. (1965). The effects of psychotherapy. International Journal of Psychiatry, 1, 97-178.
Glass, G. V (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.
Glass, G.V (1978). Integrating findings: The meta-analysis of research. Review of Research in Education, 5, 351-379.
Glass, G. V, McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: SAGE Publications.
Glass, G.V et al. (1982). School class size: Research and policy. Beverly Hills, CA: SAGE Publications.
Glass, G.V and Robbins, M.P. (1967). A critique of experiments on the role of neurological organization in reading performance. Reading Research Quarterly, 3, 5-51.
Hedges, L. V., Laine, R. D., & Greenwald, R. (1994). Does Money Matter? A Meta- Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes. Educational Researcher, 23(3): 5-14.
Hedges, L.V. and Olkin, I. Statistical methods for meta-analysis. New York: Academic Press; 1985.
Hunt, M. (1997). How science takes stock: The story of meta-analysis. NY: Russell Sage Foundation.
Hunter, J.E. & Schmidt, F.L. (1990). Methods of meta-analysis: correcting error and bias in research findings. Newbury Park (CA): SAGE Publications.
Hunter, J.E., Schmidt, F.L. & Jackson, G.B. (1982). Meta-analysis: cumulating research findings across studies. Beverly Hills, CA: SAGE Publications.
Light, R. J., Singer, J. D., & Willett, J. B. (1990). By design: Planning research on higher education. Cambridge, MA: Harvard University Press.
Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-15.
Meehl. P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.
Rosenthal, R. (1976). Experimenter effects in behavioral research. New York: John Wiley.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Rev. ed. Newbury Park (CA): SAGE Publications.
Rubin, D. (1990). A new perspective. Chp. 14 (pp. 155-166) in Wachter, K. W. & Straf, M. L. (Eds.). The future of meta-analysis. New York: Russell Sage Foundation.
Smith, M.L. and Glass, G.V. (1977). Meta-analysis of psychotherapy outcome studies. American Psychologist, 32, 752-60.
Smith, M.L., Glass, G.V and Miller, T.I. (1980). The benefits of psychotherapy. Baltimore: Johns Hopkins University Press.
Underwood, B.J. Interference and forgetting. Psychological Review, 64(1), 49–60.
Wachter, K.W. and Straf, M.L., (Editors). (1990). The future of meta-analysis. New York: Russell Sage Foundation.
Wolf, F.M. (1986). Meta-analysis: quantitative methods for research synthesis. Beverly Hills, CA: SAGE Publications.