A Personal History
Gene V Glass
Arizona State University
It has been nearly 25 years since meta-analysis, under that name and in its current guise made its first appearance. I wish to avoid the weary references to the new century or millenium—depending on how apocalyptic you're feeling (besides, it's 5759 on my calendar anyway)—and simply point out that meta-analysis is at the age when most things graduate from college, so it's not too soon to ask what accounting can be made of it. I have refrained from publishing anything on the topic of the methods of meta-analysis since about 1980 out of a reluctance to lay some heavy hand on other people's enthusiasms and a wish to hide my cynicism from public view. Others have eagerly advanced its development and I'll get to their contributions shortly (Cooper & Hedges, 1994; Hedges and Olkin, 1985; Hunter, Schmidt and Jackson, 1982).
Autobiography may be the truest, most honest narrative, even if it risks self-aggrandizement, or worse, self-deception. Forgive me if I risk the latter for the sake of the former. For some reason it is increasingly difficult these days to speak in any other way.
In the span of this rather conventional paper, I wish to review the brief history of the form of quantitative research synthesis that is now generally known as "meta-analysis" (though I can't possibly recount this history as well as has Morton Hunt (1997) in his new book How Science Takes Stock: The story of Meta-Analysis), tell where it came from, why it happened when it did, what was wrong with it and what remains to be done to make the findings of research in the social and behavioral sciences more understandable and useful.
In 25 years, meta-analysis has grown from an unheard of preoccupation of a very small group of statisticians working on problems of research integration in education and psychotherapy to a minor academic industry, as well as a commercial endeavor (see http://epidemiology.com/ and http://members.tripod.com/~Consulting_Unlimited/, for example). A keyword web search—the contemporary measure of visibility and impact—(Excite, January 28, 2000) on the word "meta-analysis" brings 2,200 "hits" of varying degrees of relevance, of course. About 25% of the articles in the Psychological Bulletin in the past several years have the term "meta-analysis" in the title. Its popularity in the social sciences and education is nothing compared to its influence in medicine, where literally hundreds of meta-analyses have been published in the past 20 years. (In fact, my internist quotes findings of what he identifies as published meta-analyses during my physical exams.) An ERIC search shows well over 1,500 articles on meta-analyses written since 1975.
Surely it is true that as far as meta-analysis is concerned, necessity was the mother of invention, and if it hadn't been invented—so to speak—in the early 1970s it would have been invented soon thereafter since the volume of research in many fields was growing at such a rate that traditional narrative approaches to summarizing and integrating research were beginning to break down. But still, the combination of circumstances that brought about meta-analysis in about 1975 may itself be interesting and revealing. There were three circumstances that influenced me.
The first was personal. I left the University of Wisconsin in 1965 with a brand new PhD in psychometrics and statistics and a major league neurosis—years in the making—that was increasingly making my life miserable. Luckily, I found my way into psychotherapy that year while on the faculty of the University of Illinois and never left it until eight years later while teaching at the University of Colorado. I was so impressed with the power of psychotherapy as a means of changing my life and making it better that by 1970 I was studying clinical psychology (with the help of a good friend and colleague Vic Raimy at Boulder) and looking for opportunities to gain experience doing therapy.
In spite of my personal enthusiasm for psychotherapy, the weight of academic opinion at that time derived from Hans Eysenck's frequent and tendentious reviews of the psychotherapy outcome research that proclaimed psychotherapy as worthless—a mere placebo, if that. I found this conclusion personally threatening—it called into question not only the preoccupation of about a decade of my life but my scholarly judgment (and the wisdom of having dropped a fair chunk of change) as well. I read Eysenck's literature reviews and was impressed primarily with their arbitrariness, idiosyncrasy and high-handed dismissiveness. I wanted to take on Eysenck and show that he was wrong: psychotherapy does change lives and make them better.
The second circumstance that prompted meta-analysis to come out when it did had to do with an obligation to give a speech. In 1974, I was elected President of the American Educational Research Association, in a peculiar miscarriage of the democratic process. This position is largely an honorific title that involves little more than chairing a few Association Council meetings and delivering a "presidential address" at the Annual Meeting. It's the "presidential address" that is the problem. No one I know who has served as AERA President really feels that they deserved the honor; the number of more worthy scholars passed over not only exceeds the number of recipients of the honor by several times, but as a group they probably outshine the few who were honored. Consequently, the need to prove one's worthiness to oneself and one's colleagues is nearly overwhelming, and the most public occasion on which to do it is the Presidential address, where one is assured of an audience of 1,500 or so of the world's top educational researchers. Not a few of my predecessors and contemporaries have cracked under this pressure and succumbed to the temptation to spin out grandiose fantasies about how educational research can become infallible or omnipotent, or about how government at national and world levels must be rebuilt to conform to the dictates of educational researchers. And so I approached the middle of the 1970s knowing that by April 1976 I was expected to release some bombast on the world that proved my worthiness for the AERA Presidency, and knowing that most such speeches were embarrassments spun out of feelings of intimidation and unworthiness. (A man named Richard Krech, I believe, won my undying respect when I was still in graduate school; having been distinguished by the American Psychological Association in the 1960s with one of its highest research awards, Krech, a professor at Berkeley, informed the Association that he was honored, but that he had nothing particularly new to report to the organization at the obligatory annual convention address, but if in the future he did have anything worth saying, they would hear it first.)
The third set of circumstances that joined my wish to annihilate Eysenck and prove that psychotherapy really works and my need to make a big splash with my Presidential Address was that my training under the likes of Julian Stanley, Chester Harris, Henry Kaiser and George E. P. Box at Wisconsin in statistics and experimental design had left me with a set of doubts and questions about how we were advancing the empirical agenda in educational research. In particular, I had learned to be very skeptical of statistical significance testing; I had learned that all research was imperfect in one respect or another (or, in other words, there are no "perfectly valid" studies nor any line that demarcates "valid" from "invalid" studies); and third, I was beginning to question a taken-for-granted assumption of our work that we progress toward truth by doing what everyone commonly refers to as "studies." (I know that these are complex issues that need to be thoroughly examined to be accurately communicated, and I shall try to return to them.) I recall two publications from graduate school days that impressed me considerably. One was a curve relating serial position of a list of items to be memorized to probability of correct recall that Benton Underwood (1957) had synthesized from a dozen or more published memory experiments. The other was a Psychological Bulletin article by Sandy Astin on the effects of glutamic acid on mental performance (whose results presaged a meta-analysis of the Feingold diet research 30 years later in that poorly controlled experiments showed benefits and well controlled experiments did not).
Permit me to say just a word or two about each of these studies because they very much influenced my thinking about how we should "review" research. Underwood had combined the findings of 16 experiments on serial learning to demonstrate a consistent geometrically decreasing curve describing the declining probability of correct recall as a function of number of previously memorized items, thus giving strong weight to an interference explanation of recall errors. What was interesting about Underwood's curve was that it was an amalgamation of studies that had different lengths of lists and different items to be recalled (nonsense syllables, baseball teams, colors and the like).
Astin's Psychological Bulletin review had attracted my attention in another respect. Glutamic acid—it will now scarcely be remembered—was a discovery of the 1950s that putatively increased the ability of tissue to absorb oxygen. Reasoning with the primitive constructs of the time, researchers hypothesized that more oxygen to the brain would produce more intelligent behavior. (It is not known what amount of oxygen was reaching the brains of the scientists proposing this hypothesis.) A series of experiments in the 1950s and 1960s tested glutamic acid against "control groups" and by 1961, Astin was able to array these findings in a crosstabulation that showed that the chances of finding a significant effect for glutamic acid were related (according to a chi-square test) to the presence or absence of various controls in the experiment; placebos and blinding of assessors, for example, were associated with no significant effect of the acid. As irrelevant as the chi-square test now seems, at the time I saw it done, it was revelatory to see "studies" being treated as data points in a statistical analysis. (In 1967, I attempted a similar approach while reviewing the experimental evidence on the Doman-Delacato pattern therapy. Glass and Robbins, 1967)
At about the same time I was reading Underwood and Astin, I certainly must have read Ben Bloom's Stability of Human Characteristics (1963), but its aggregated graphs of correlation coefficients made no impression on me, because it was many years after work to be described below that I noticed a similarlity between his approach and meta-analysis. Perhaps the connections were not made because Bloom dealt with variables such as age, weight, height, IQ and the like where the problems of dissimilarity of variables did not force one to worry about the kinds of problem that lie at the heart of meta-analysis.
If precedence is of any concern, Bob Rosenthal deserves as much credit as anyone for furthering what we now conveniently call "meta-analysis." In 1976, he published Experimenter Effects in Behvaioral Research, which contained calculations of many "effect sizes" (i.e., standardized mean differences) that were then compared across domains or conditions. If Bob had just gone a little further in quantifying study characteristics and subjecting the whole business to regression analyses and what-not, and then thinking up a snappy name, it would be his name that came up every time the subject is research integration. But Bob had an even more positive influence on the development of meta-analysis than one would infer from his numerous methodological writings on the subject. When I was making my initial forays onto the battlefield of psychotherapy outcome research—about which more soon—Bob wrote me a very nice and encouraging letter in which he indicated that the approach we were taking made perfect sense. Of course, it ought to have made sense to him, considering that it was not that different from what he had done in Experimenter Effects. He probably doesn't realize how important that validation from a stranger was. (And while on the topic of snappy names, although people have suggested or promoted several polysyllabic alternatives—quantitative synthesis, statistical research integration—the name meta- analysis, suggested by Michael Scriven's meta-evaluation (meaning the evaluation of evaluations), appears to have caught on. To press on further into it, the "meta" comes from the Greek preposition meaning "behind" or "in back of." Its application as in "metaphysics" derives from the fact that in the publication of Aristotle's writings during the Middle Ages, the section dealing with the transcendental was bound immediately behind the section dealing with physics; lacking any title provided by its author, this final section became known as Aristotle's "metaphysics." So, in fact, metaphysics is not some grander form of physics, some all encompassing, overarching general theory of everthing; it is merely what Aristotle put after the stuff he wrote on physics. The point of this aside is to attempt to leach out of the term "meta-analysis" some of the grandiosity that others see in it. It is not the grand theory of research; it is simply a way of speaking of the statistical analysis of statistical analyses.)
So positioned in these circumstances, in the summer of 1974, I set about to do battle with Dr. Eysenck and prove that psychotherapy—my psychotherapy—was an effective treatment. (Incidentally, though it may be of only the merest passing interest, my preferences for psychotherapy are Freudian, a predilection that causes Ron Nelson and other of my ASU colleagues great distress, I'm sure.) I joined the battle with Eysenck's 1965 review of the psychotherapy outcome literature. Eysenck began his famous reviews by eliminating from consideration all theses, dissertations, project reports or other contemptible items not published in peer-reviewed journals. This arbitrary exclusion of literally hundreds of evaluations of therapy outcomes was indefensible. It's one thing to believe that peer review guarantees truth; it is quite another to believe that all truth appears in peer reviewed journals. (The most important paper on the multiple comparisons problem in ANOVA was distributed as an unpublished ditto manuscript from the Princeton University Mathematics Department by John Tukey; it never was published in a peer reviewed journal.)
Next, Eysenck eliminated any experiment that did not include an untreated control group. This makes no sense whatever, since head-to-head comparisons of two different types of psychotherapy contribute a great deal to our knowledge of psychotherapy effects. If a horse runs 20 mph faster than a man and 35 mph faster than a pig, I can conclude with confidence that the man will outrun the pig by 15 mph. Having winnowed a huge literature down to 11 studies (!) by whim and prejudice, Eysenck proceeded to describe their findings soley in terms of whether or not statistical significance was attained at the .05 level. No matter that the results may have barely missed the .05 level or soared beyond it. All that Eysenck considered worth noting about an experiment was whether the differences reached significance at the .05 level. If it reached significance at only the .07 level, Eysenck classified it as showing "no effect for psychotherapy."
Finally, Eysenck did something truly staggering in its illogic. If a study showed significant differences favoring therapy over control on what he regarded as a "subjective" measure of outcome (e.g., the Rorschach or the Thematic Apperception Test), he discounted the findings entirely. So be it; he may be a tough case, but that's his right. But then, when encountering a study that showed differences on an "objective" outcome measure (e.g., GPA) bit no differences on a subjective measure (like the TAT), Eysenck discounted the entire study because the outcome differences were "inconsistent."
Looking back on it, I can almost credit Eysenck with the invention of meta-analysis by anti-thesis. By doing everything in the opposite way that he did, one would have been led straight to meta-analysis. Adopt an a posteriori attitude toward including studies in a synthesis, replace statistical significance by measures of strength of relationship or effect, and view the entire task of integration as a problem in data analysis where "studies" are quantified and the resulting data-base subjected to statistical analysis, and meta-analysis assumes its first formulation. (Thank you, Professor Eysenck.)
Working with my colleague Mary Lee Smith, I set about to collect all the psychotherapy outcome studies that could be found and subjected them to this new form of analysis. By May of 1975, the results were ready to try out on a friendly group of colleagues. The May 12th Group had been meeting yearly since about 1968 to talk about problems in the area of program evaluation. The 1975 meeting was held in Tampa at Dick Jaeger's place. I worked up a brief handout and nervously gave my friends an account of the preliminary results of the psychotherapy meta-analysis. Lee Cronbach was there; so was Bob Stake, David Wiley, Les McLean and other trusted colleagues who could be relied on to demolish any foolishness they might see. To my immense relief they found the approach plausible or at least not obviously stupid. (I drew frequently in the future on that reassurance when others, whom I respected less, pronounced the entire business stupid.)
The first meta-analysis of the psychotherapy outcome research found that the typical therapy trial raised the treatment group to a level about two-thirds of a standard deviation on average above untreated controls; the average person receiving therapy finished the experiment in a position that exceeded the 75th percentile in the control group on whatever outcome measure happened to be taken. This finding summarized dozens of experiments encompassing a few thousand persons as subjects and must have been cold comfort to Professor Eysenck.
An expansion and reworking of the psychotherapy experiments resulted in the paper that was delivered as the much feared AERA Presidential address in April 1976. Its reception was gratifying. Two months later a long version was presented at a meeting of psychotherapy researchers in San Diego. Their reactions foreshadowed the eventual reception of the work among psychologists. Some said that the work was revolutionary and proved what they had known all along; others said it was wrongheaded and meaningless. The widest publication of the work came in 1977, in a now, may I say, famous article by Smith and Glass in the American Psychologist. Eysenck responded to the article by calling it "mega-silliness," a moderately clever play on meta- analysis that nonetheless swayed few.
Psychologists tended to fixate on the fact that the analysis gave no warrant to any claims that one type or style of psychotherapy was any more effective than any other: whether called "behavioral" or "Rogerian" or "rational" or "psychodynamic," all the therapies seemed to work and to work to about the same degree of effectiveness. Behavior therapists, who had claimed victory in the psychotherapy horserace because they were "scientific" and others weren't, found this conclusion unacceptable and took it as reason enough to declare meta-analysis invalid. Non- behavioral therapists—the Rogerians, Adlerians and Freudians, to name a few—hailed the meta-analysis as one of the great achievements of psychological research: a "classic," a "watershed." My cynicism about research and much of psychology dates from approximately this period.
The first appearances of meta-analysis in the 1970s were not met universally with encomiums and expressions of gratitude. There was no shortage of critics who found the whole idea wrong-headed, senseless, misbegotten, etc.
The Apples-and-Oranges Problem
Of course the most often repeated criticism of meta-analysis was that it was meaningless because it "mixed apples and oranges." I was not unprepared for this criticism; indeed, I had long before prepared my own defense: "Of course it mixes apples and oranges; in the study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor worthy of true scientists; comparing apples to apples is trivial." But I misjudged the degree to which this criticism would take hold of people's opinions and shut down their minds. At times I even began to entertain my own doubts that it made sense to integrate any two studies unless they were studies of "the same thing." But, the same persons who were arguing that no two studies should be compared unless they were studies of the "same thing," were blithely comparing persons (i.e., experimental "subjects") within their studies all the time. This seemed inconsistent. Plus, I had a glimmer of the self-contradictory nature of the statement "No two things can be compared unless they are the same." If they are the same, there is no reason to compare them; indeed, if "they" are the same, then there are not two things, there is only one thing and comparison is not an issue. And yet I had a gnawing insecurity that the critics might be right. One study is an apple, and a second study is an orange; and comparing them is as stupid as comparing apples and oranges, except that sometimes I do hesitate while considering whether I'm hungry for an apple or an orange.
At about this time—late 1970s—I was browsing through a new book that I had bought out of a vague sense that it might be worth my time because it was written by a Harvard philosopher, carried a title like Philosophical Explanations and was written by an author—Robert Nozick—who had written one of the few pieces on the philosophy of the social sciences that ever impressed me as being worth rereading. To my amazement, Nozick spent the first one hundred pages of his book on the problem of "identity," i.e., what does it mean to say that two things are the same? Starting with the puzzle of how two things that are alike in every respect would not be one thing, Nozick unraveled the problem of identity and discovered its fundamental nature underlying a host of philosophical questions ranging from "How do we think?" to "How do I know that I am I?" Here, I thought at last, might be the answer to the "apples and oranges" question. And indeed, it was there.
Nozick considered the classic problem of Theseus's ship. Theseus, King of Thebes, and his men are plying the waters of the Mediterranean. Each day a sailor replaces a wooden plank in the ship. After nearly five years, every plank has been replaced. Are Theseus and his men still sailing in the same ship that was launched five years earlier on the Mediterranean? "Of course," most will answer. But suppose that as each original plank was removed, it was taken ashore and repositioned exactly as it had been on the waters, so that at the end of five years, there exists a ship on shore, every plank of which once stood in exactly the same relationship to every other in what five years earlier had been Theseus's ship. Is this ship on shore—which we could easily launch if we so chose—Theseus's ship? Or is the ship sailing the Mediterranean with all of its new planks the same ship that we originally regarded as Theseus's ship? The answer depends on what we understand the concept of "same" to mean?
Consider an even more troubling example that stems from the problem of the persistence of personal identity. How do I know that I am that person who I was yesterday, or last year, or twenty-five years ago? Why would an old high-school friend say that I am Gene Glass, even though hundreds, no thousands of things about me have changed since high school? Probably no cells are in common between this organism and the organism that responded to the name "Gene Glass" forty years ago; I can assure you that there are few attitudes and thoughts held in common between these two organisms—or is it one organism? Why then, would an old high-school friend, suitably prompted, say without hesitation, "Yes, this is Gene Glass, the same person I went to high school with." Nozick argued that the only sense in which personal identity survives across time is in the sense of what he called "the closest related continuer." I am still recognized as Gene Glass to those who knew me then because I am that thing most closely related to that person to whom they applied the name "Gene Glass" over forty years ago. Now notice that implied in this concept of the "closest related continuer" are notions of distance and relationship. Nozick was quite clear that these concepts had to be given concrete definition to understand how in particular instances people use the concept of identity. In fact, to Nozick's way of thinking, things are compared by means of weighted functions of constituent factors, and their "distance" from each other is "calculated" in many instances in a Euclidean way.
Consider Theseus's ship again. Is the ship sailing the seas the "same" ship that Theseus launched five years earlier? Or is the ship on the shore made of all the original planks from that first ship the "same" as Theseus's original ship? If I give great weight to the materials and the length of time those materials functioned as a ship (i.e., to displace water and float things), then the vessel on the shore is the closest related continuation of what historically had been called "Theseus's ship." But if, instead, I give great weight to different factors such as the importance of the battles the vessel was involved in (and Theseus's big battles were all within the last three years), then the vessel that now floats on the Mediterranean—not the ship on the shore made up of Theseus's original planks—is Theseus's ship, and the thing on the shore is old spare parts.
So here was Nozick saying that the fundamental riddle of how two things could be the same ultimately resolves itself into an empirical question involving observable factors and weighing them in various combinations to determine the closest related continuer. The question of "sameness" is not an a priori question at all; apart from being a logical impossibility, it is an empirical question. For us, no two "studies" are the same. All studies differ and the only interesting questions to ask about them concern how they vary across the factors we conceive of as important. This notion is not fully developed here and I will return to it later.
The "Flat Earth" Criticism
I may not be the best person to critique meta-analysis, for obvious reasons. However, I will cop to legitimate criticisms of the approach when I see them, and I haven't seen many. But one criticism rings true because I knew at the time that I was being forced into a position with which I wasn't comfortable. Permit me to return to the psychotherapy meta-analysis.
Eysenck was, as I have said, a nettlesome critic of the psychotherapy establishment in the 1960s and 1970s. His exaggerated and inflammatory statements about psychotherapy being worthless (no better than a placebo) were not believed by psychotherapists or researchers, but they were not being effectively rebutted either. Instead of taking him head-on, as my colleagues and I attempted to do, researchers, like Gordon Paul, for example, attempted to argue that the question whether psychotherapy was effective was fundamentally meaningless. Rather, asserted Paul while many others assented, the only legitimate research question was "What type of therapy, with what type of client, produces what kind of effect?" I confess that I found this distracting dodge as frustrating as I found Eysenck's blanket condemnation. Here was a critic—Eysenck—saying that all psychotherapists are either frauds or gullible, self-deluded incompetents, and the establishment's response is to assert that he is not making a meaningful claim. Well, he was making a meaningful claim; and I already knew enough from the meta-analysis of the outcome studies to know that Paul's question was unanswerable due to insufficient data, and that reseacrhers were showing almost no interest in collecting the kind of data that Paul and others argued were the only meaningful data.
It fell to me, I thought, to argue that the general question "Is psychotherapy effective?" is meaningful and that psychotherapy is effective. Such generalizations—across types of therapy, types of client and types of outcome—are meaningful to many people—policy makers, average citizens—if not to psychotherapy researchers or psychotherapists themselves. It was not that I necessarily believed that different therapies did not have different effects for different kinds of people; rather, I felt certain that the available evidence, tons of it, did not establish with any degree of confidence what these differential effects were. It was safe to say that in general psychotherapy works on many things for most people, but it was impossible to argue that this therapy was better than that therapy for this kind of problem. (I might add that twenty years after the publication of The Benefits of Psychotherapy, I still have not seen compelling answers to Paul's questions, nor is their evidence of researchers having any interest in answering them.)
The circumstances of the debate, then, put me in the position of arguing, circa 1980, that there are very few differences among various ways of treating human beings and that, at least, there is scarcely any convincing experimental evidence to back up claims of differential effects. And that policy makers and others hardly need to waste their time asking such questions or looking for the answers. Psychotherapy works; all types of therapy work about equally well; support any of them with your tax dollars or your insurance policies. Class size reductions work—very gradually at first (from 30 to 25 say) but more impressively later (from 15 to 10); they work equally for all grades, all subjects, all types of student. Reduce class sizes, and it doesn't matter where or for whom.
Well, one of my most respected colleagues called me to task for this way of thinking and using social science research. In a beautiful and important paper entitled "Prudent Aspirations for the Social Sciences," Lee Cronbach chastised his profession for promising too much and chastised me for expecting too little. He lumped me with a small group of like-minded souls into what he named the "Flat Earth Society," i.e., a group of people who believe that the terrain that social scientists explore is featureless, flat, with no interesting interactions or topography. All therapies work equally well; all tests predict success to about the same degree; etc.:
"...some of our colleagues are beginning to sound like a kind of Flat Earth Society. They tell us that the world is essentially simple: most social phenomena are adequately described by linear relations; one-parameter scaling can discover coherent variables independent of culture and population; and inconsistences among studies of the same kind will vanish if we but amalgamate a sufficient number of studies.... The Flat Earth folk seek to bury any complex hypothesis with an empirical bulldozer." (Cronbach, 1982, p. 70.)Cronbach's criticism stung because it was on target. In attempting to refute Eysenck's outlandishness without endorsing the psychotherapy establishment's obfuscation, I had taken a position of condescending simplicity. A meta-analysis will give you the BIG FACT, I said; don't ask for more sophisticated answers; they aren't there. My own work tended to take this form, and much of what has ensued in the past 25 years has regrettably followed suit. Effect sizes—if it is experiments that are at issue—are calculated, classified in a few ways, perhaps, and all their variability is then averaged across. Little effort is invested in trying to plot the complex, variegated landscape that most likely underlies our crude averages.
Consider an example that may help illuminate these matters. Perhaps the most controversial conclusion from the psychotherapy meta-analysis that my colleagues and I published in 1980 was that there was no evidence favoring behavioral psychotherapies over non-behavioral psychotherapies. This finding was vilified by the behavioral therapy camp and praised by the Rogerians and Freudians. Some years later, prodded by Cronbach's criticism, I returned to the database and dug a little deeper. What I found appears in Figure 2. When the nine experiments extant in 1979—and I would be surprised if there are many more now—in which behavioral and non-behavioral psychotherapies are compared in the same experiment between randomized groups and the effects of treatment are plotted as a function of follow-up time, the two curves in Figure 2 result. The findings are quite extraordinary and suggestive. Behavioral therapies produce large short-term effects which decay in strength over the first year of follow-up; non- behavioral therapies produce initially smaller effects which increase over time. The two curves appear to be converging on the same long-term effect. I leave it to the reader to imagine why. One answer, I suspect, is not arcane and is quite plausible.
In the twenty-five years between the first appearance of the word "meta-analysis" in print and today, there have been several attempts to modify the approach, or advance alternatives to it, or extend the method to reach auxiliary issues. If I may be so cruel, few of efforts have added much. One of the hardest things to abide in following the developments in meta-analysis methods in the past couple of decades was the frequent observation that what I had contributed to the problem of research synthesis was the idea of dividing mean differences by standard deviations. "Effect sizes," as they are called, had been around for decades before I opened my first statistics text. Having to read that "Glass has proposed integrating studies by dividing mean differences by standard deviations and averaging them" was a bitter pill to swallow. Some of the earliest work that I and my colleagues did involved using a variety of outcome measures to be analyzed and synthesized: correlations, regression coefficients, proportions, odds ratios. Well, so be it; better to be mentioned in any favorable light than not to be remembered at all.
After all, this was not as hard to take as newly minted confections such as "best evidence research synthesis," a come-lately contribution that added nothing whatsoever to what myself and many others had been saying repeatedly on the question of whether meta- analyses should use all studies or only "good" studies. I remain staunchly committed to the idea that meta-analyses must deal with all studies, good bad and indifferent, and that their results are only properly understood in the context of each other, not after having been censored by some a priori set of prejudices. An effect size of 1.50 for 20 studies employing randomized groups has a whole different meaning when 50 studies using matching show an average effect of 1.40 than if 50 matched groups studies show an effect of -.50, for example.
The appropriate role for inferential statistics in meta-analysis is not merely unclear, it has been seen quite differently by different methodologists in the 25 years since meta- analysis appeared. In 1981, in the first extended discussion of the topic (Glass, McGaw and Smith, 1981), I raised doubts about the applicability of inferential statistics in meta- analysis. Inference at the level of persons within seemed quite unnecessary, since even a modest size synthesis will involve a few hundred persons (nested within studies) and lead to nearly automatic rejection of null hypotheses. Moreover, the chances are remote that the persons or subjects within studies were drawn from defined populations with anything even remotely resembling probabilistic techniques. Hence, probabilistic calculations advanced as if subjects had been randomly selected would be dubious. At the level of "studies," the question of the appropriateness of inferential statistics can be posed again, and the answer again seems to be negative. There are two instances in which common inferential methods are clearly appropriate, not just in mata-analysis but in any research: 1) when a well defined population has been randomly sampled, and 2) when subjects have been randomly assigned to conditions in a controlled experiment. In the latter case, Fisher showed how the permutation test can be used to make inferences to the universe of all possible permutations. But this case is of little interest to meta-analysts who never assign units to treatments. Moreover, the typical meta-analysis virtually never meets the condition of probabilistic sampling of a population (though in one instance (Smith, Glass & Miller, 1980), the available population of psychoactive drug treatment experiments was so large that a random sample of experiments was in fact drawn for the meta- analysis). Inferential statistics has little role to play in meta-analysis
It is common to acknowledge, in meta-analysis and elsewhere, that many data sets fail to meet probabilistic sampling conditions, and then to argue that one ought to treat the data in hand "as if" it were a random sample of some hypothetical population. One must be wary here of the slide from "hypothesis about a population" into "a hypothetical population." They are quite different things, the former being standard and unobjectionable, the latter being a figment with which we hardly know how to deal. Under this stipulation that one is making inferences not to some defined or known population but a hypothetical one, inferential techniques are applied and the results inspected. The direction taken mirrors some of the earliest published opinion on this problem in the context of research synthesis, expressed, for example, by Mosteller and his colleagues in 1977: "One might expect that if our MEDLARS approach were perfect and produced all the papers we would have a census rather than a sample of the papers. To adopt this model would be to misunderstand our purpose. We think of a process producing these research studies through time, and we think of our sample—even if it were a census—as a sample in time from the process. Thus, our inference would still be to the general process, even if we did have all appropriate papers from a time period." (Gilbert, McPeek and Mosteller, 1977, p. 127; quoted in Cook et al., 1992, p. 291) This position is repeated in slightly different language by Larry Hedges in Chapter 3 "Statistical Considerations" of the Handbook of Research Synthesis (1994): "The universe is the hypothetical collection of studies that could be conducted in principle and about which we wish to generalize. The study sample is the ensemble of studies that are used in the review and that provide the effect size data used in the research synthesis." (p. 30)
These notions appear to be circular. If the sample is fixed and the population is allowed to be hypothetical, then surely the data analyst will imagine a population that resembles the sample of data. If I show you a handful of red and green M&Ms, you will naturally assume that I have just drawn my hand out of a bowl of mostly red and green M&Ms, not red and green and brown and yellow ones. Hence, all of these "hypothetical populations" will be merely reflections of the samples in hand and there will be no need for inferential statistics. Or put another way, if the population of inference is not defined by considerations separate from the characterization of the sample, then the population is merely a large version of the sample. With what confidence is one able to generalize the character of this sample to a population that looks like a big version of the sample? Well, with a great deal of confidence, obviously. But then, the population is nothing but the sample writ large and we really know nothing more than what the sample tells us in spite of the fact that we have attached misleadingly precise probability numbers to the result.
Hedges and Olkin (1985) have developed inferential techniques that ignore the pro forma testing (because of large N) of null hypotheses and focus on the estimation of regression functions that estimate effects at different levels of study. They worry about both sources of statistical instability: that arising from persons within studies and that which arises from variation between studies. The techniques they present are based on traditional assumptions of random sampling and independence. It is, of course, unclear to me precisely how the validity of their methods are compromised by failure to achieve probabilistic sampling of persons and studies.
The irony of traditional hypothesis testing approaches applied to meta-analysis is that whereas consideration of sampling error at the level of persons always leads to a pro forma rejection of "null hypotheses" (of zero correlation or zero average effect size), consideration of sampling error at the level of study characteristics (the study, not the person as the unit of analysis) leads to too few rejections (too many Type II errors, one might say). Hedges's homogeneity test of the hypothesis that all studies in a group estimate the same population parameter frequently seen in published meta-analyses these days. Once a hypothesis of homogeneity is accepted by Hedges's test, one is advised to treat all studies within the ensemble as the same. Experienced data analysts know, however, that there is typically a good deal of meaningful covariation between study characteristics and study findings even within ensembles where Hedges's test can not reject the homogeneity hypothesis. The situation is parallel to the experience of psychometricians discovering that they could easily interpret several more common factors than inferential solutions (maximum- likelihood; LISREL) could confirm. The best data exploration and discovery are more complex and convincing than the most exact inferential test. In short, classical statistics seems not able to reproduce the complex cognitive processes that are commonly applied with success by data analysts.
Donald Rubin (1990) addressed some of these issues squarely and articulated a position that I find very appealing : "...consider the idea that sampling and representativeness of the studies in a meta-analysis are important. I will claim that this is nonsense—we don't have to worry about representing a population but rather about other far more important things." (p. 155) These more important things to Rubin are the estimation of treatment effects under a set of standard or ideal study conditions. This process, as he outlined it, involves the fitting of response surfaces (a form of quantitative model building) between study effects (Y) and study conditions (X, W, Z etc.). I would only add to Rubin's statement that we are interested in not merely the response of the system under ideal study conditions but under many conditions having nothing to do with an ideally designed study, e.g., person characteristics, follow-up times and the like.
By far most meta-analyses are undertaken in pursuit not of scientific theory but technological evaluation. The evaluation question is never whether some hypothesis or model is accepted or rejected but rather how "outputs" or "benefits" or "effect sizes" vary from one set of circumstances to another; and the meta-analysis rarely works on a collection of data that can sensibly be described as a probability sample from anything.
If our efforts to research and improve education are to prosper, meta-analysis will have to be replaced by more useful and more accurate ways of synthesizing research findings. To catch a glimpse of what this future for research integration might look like, we need to look back at the deficiencies in our research customs that produced meta-analysis in the first place.
First, the high cost in the past of publishing research results led to cryptic reporting styles that discarded most of the useful information that research revealed. To encapsulate complex relationships in statements like "significant at the .05 level" was a travesty—a travesty that continues today out of bad habit and bureaucratic inertia.
Second, we need to stop thinking of ourselves as scientists testing grand theories, and face the fact that we are technicians collecting and collating information, often in quantitative forms. Paul Meehl (1967; 1978) dispelled once and for all the misconception that we in, what he called, the "soft social sciences" are testing theories in any way even remotely resembling how theory focuses and advances research in the hard sciences. Indeed, the mistaken notion that we are theory driven has, in Meehl's opinion, led us into a worthless pro forma ritual of testing and rejecting statistical hypotheses that are a priori known to be 99% false before they are tested.
Third, the conception of our work that held that "studies" are the basic, fundamental unit of a research program may be the single most counterproductive influence of all. This idea that we design a "study," and that a study culminates in the test of a hypothesis and that a hypothesis comes from a theory—this idea has done more to retard progress in educational research than any other single notion. Ask an educational researcher what he or she is up to, and they will reply that they are "doing a study," or "designing a study," or "writing up a study" for publication. Ask a physicist what's up and you'll never hear the word "study." (In fact, if one goes to http://xxx.lanl.gov where physicists archive their work, one will seldom see the word "study." Rather, physicists—the data gathering experimental ones—report data, all of it, that they have collected under conditions that they carefully described. They contrive interesting conditions that can be precisely described and then they report the resulting observations.)
Meta-analysis was created out of the need to extract useful information from the cryptic records of inferential data analyses in the abbreviated reports of research in journals and other printed sources. "What does this t-test really say about the efficacy of ritalin in comparison to caffeine?" Meta-analysis needs to be replaced by archives of raw data that permit the construction of complex data landscapes that depict the relationships among independent, dependent and mediating variables. We wish to be able to answer the question, "What is the response of males ages 5-8 to ritalin at these dosage levels on attention, acting out and academic achievement after one, three, six and twelve months of treatment?"
We can move toward this vision of useful synthesized archives of research now if we simply re-orient our ideas about what we are doing when we do research. We are not testing grand theories, rather we are charting dosage-response curves for technological interventions under a variety of circumstances. We are not informing colleagues that our straw-person null hypothesis has been rejected at the .01 level, rather we are sharing data collected and reported according to some commonly accepted protocols. We aren't publishing "studies," rather we are contributing to data archives.
Five years ago, this vision of how research should be reported and shared seemed hopelessly quixotic. Now it seems easily attainable. The difference is the I-word: the Internet. In 1993, spurred by the ludicrously high costs and glacial turn-around times of traditional scholarly journals, I created an internet-based peer-reviewed journal on education policy analysis (http://epaa.asu.edu). This journal, named Education Policy Analysis Archives, is now in its seventh year of publication, has published 150 articles, is accessed daily without cost by nearly 1,000 persons (the other three paper journals in this field have average total subscription bases of fewer than 1,000 persons), and has an average "lag" from submission to publication of about three weeks. Moreover, we have just this year started accepting articles in both English and Spanish. And all of this has been accomplished without funds other than the time I put into it as part of my normal job: no secretaries, no graduate assistants, nothing but a day or two a week of my time.
Two years ago, we adopted the policy that any one publishing a quantitative study in the journal would have to agree to archive all the raw data at the journal website so that the data could be downloaded by any reader. Our authors have done so with enthusiasm. I think that you can see how this capability puts an entirely new face on the problem of how we integrate research findings: no more inaccurate conversions of inferential test statistics into something worth knowing like an effect size or a correlation coefficient or an odds ratio; no more speculating about distribution shapes; no more frustration at not knowing what violence has been committed when linear coefficients mask curvilinear relationships. Now we simply download each others' data, and the synthesis prize goes to the person who best assembles the pieces of the jigsaw puzzle into a coherent picture of how the variables relate to each other.
Cook, T.D. Meta-analysis for explanation — a casebook. New York: Russell Sage Foundation; 1992.
Cooper, H.M. (1989). Integrating research: a guide for literature reviews. 2nd ed. Newbury Park, CA: SAGE Publications.
Cooper, H.M. and Hedges, L. V. (Eds.) (1994). The handbook of research synthesis. New York: Russell Sage Foundation.
Cronbach, L.J. (1982). Prudent Aspirations for Social Inquiry. Chapter 5 (Pp. 61-81) in Kruskal, W.H. (Ed.), The social sciences: Their nature and uses. Chicago: The University of Chicago Press.
Eysenck, H.J. (1965). The effects of psychotherapy. International Journal of Psychiatry, 1, 97-178.
Glass, G. V (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.
Glass, G.V (1978). Integrating findings: The meta-analysis of research. Review of Research in Education, 5, 351-379.
Glass, G. V, McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: SAGE Publications.
Glass, G.V et al. (1982). School class size: Research and policy. Beverly Hills, CA: SAGE Publications.
Glass, G.V and Robbins, M.P. (1967). A critique of experiments on the role of neurological organization in reading performance. Reading Research Quarterly, 3, 5-51.
Hedges, L. V., Laine, R. D., & Greenwald, R. (1994). Does Money Matter? A Meta- Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes. Educational Researcher, 23(3): 5-14.
Hedges, L.V. and Olkin, I. Statistical methods for meta-analysis. New York: Academic Press; 1985.
Hunt, M. (1997). How science takes stock: The story of meta-analysis. NY: Russell Sage Foundation.
Hunter, J.E. & Schmidt, F.L. (1990). Methods of meta-analysis: correcting error and bias in research findings. Newbury Park (CA): SAGE Publications.
Hunter, J.E., Schmidt, F.L. & Jackson, G.B. (1982). Meta-analysis: cumulating research findings across studies. Beverly Hills, CA: SAGE Publications.
Light, R. J., Singer, J. D., & Willett, J. B. (1990). By design: Planning research on higher education. Cambridge, MA: Harvard University Press.
Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-15.
Meehl. P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.
Rosenthal, R. (1976). Experimenter effects in behavioral research. New York: John Wiley.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Rev. ed. Newbury Park (CA): SAGE Publications.
Rubin, D. (1990). A new perspective. Chp. 14 (pp. 155-166) in Wachter, K. W. & Straf, M. L. (Eds.). The future of meta-analysis. New York: Russell Sage Foundation.
Smith, M.L. and Glass, G.V. (1977). Meta-analysis of psychotherapy outcome studies. American Psychologist, 32, 752-60.
Smith, M.L., Glass, G.V and Miller, T.I. (1980). The benefits of psychotherapy. Baltimore: Johns Hopkins University Press.
Underwood, B.J. Interference and forgetting. Psychological Review, 64(1), 49–60.
Wachter, K.W. and Straf, M.L., (Editors). (1990). The future of meta-analysis. New York: Russell Sage Foundation.
Wolf, F.M. (1986). Meta-analysis: quantitative methods for research synthesis. Beverly Hills, CA: SAGE Publications.