Gene V Glass: April 2020

The following is an edited log of an asynchronous online discussion of the Tennessee Value Added Assessment System (TVAAS) for assessing teachers. This discussion took place in late 1994 and early 1995 on an internet LISTSERV known as EDPOLYAN, which was housed at the Arizona State University College of Education.

The TVAAS was one of the first techniques proposed for measuring the value that teachers added to students' learning, namely, VAM systems. "Value," in this context was taken to mean pre-test to post-test gains in class averages on standardized tests. TVAAS was developed by an agricultural statistician at the University of Tennessee, William L. Sanders, and first published in a little known journal in 1994, just months before this discussion took place. Sanders occasionally participated in the discussion that follows, but it will soon become apparent that he was not happy with the direction the discussion took. He died in 2010.

The log of the discussion was edited in an attempt to add clarity and reduce redundancy. Each participant in the discussion was given the opportunity to review the edited log and make corrections.

The reader may want to know that TVAAS and VAM more generally became the object of a great deal of attention from the political world and from academics studying statistics and teacher evaluation. VAM made appearances in the accountability plans of several states during the Obama administration -- largely because it had won the approval of Secretary of Education Arnie Duncan. Post-Obama, VAM made a few appearances in court, where decisions largely enjoined its use. Suffice it to say, by 2020 VAM is little used and has few friends.

Anyone with a continuing interest in VAM will learn much from Audrey Amrein-Beardsley's blog Vamboozled.

===============================================
Date: Fri, 9 Sep 1994 12:57:52 EDT
From: Scriven@AOL.COM

"Do you really want to see teaching become an even higher-turnover profession?" No, and it's because of the bad teacher evaluation systems now in use that we lose many good teachers, so I'm trying to improve the evaluation systems. Piece work is not the best way; it's just a counter-example to the idea one can't use individual rewards.

I do include the TN Value Added System as one of the new and promising efforts, and I'd like to hear what you see as its flaws.

The reason for rewarding teachers as individuals is that nearly all of them work as individuals; the Deming approach is fine for situations where you only have groups as the work unit.

----------------------------------------------

Date: Fri, 9 Sep 1994 23:17:35 -0500
From: SHERMAN DORN

For those of you unaware of the Tennessee Value Added Assessment system (mandated by state law), I'll try to describe it briefly: you take the standard scores from the state normative test for each child, subtract from it the standard score for the child from the previous year, and use the gain score (essentially a putative change up or down a normal curve) in a statistical equation with district, school, and teacher as explanatory variables. (For statisticians out there, I know Bill Sanders, the agronomist who created VAA [I guess he got the idea from measuring corn yields or something similar], uses a mixed-model equation. I don't know, however, which are the fixed-effect variables and which the random-effect variables.)

The problems with VAA? Well, I can think of a few:

a. The intrinsic worth of the standardized tests are questionable, at best. For test security, test questions and structure are confidential. Which means we can't judge what, in fact, the things are measuring. Not to mention the usual shenanigans that occur in schools with high-stakes tests.
b. The tests for different grades (first, second, etc.) have different norming populations that are not equivalent: some people are retained in every grade, there's some differential migration, more kids in higher grades are certified in special ed (and thus likely excluded from the norming population) -- with the end result that a standard score of 500 in first grade (putatively at the mean of a normal curve) cannot mean the same thing for ANY OTHER GRADE. Ergo, subtracting standard scores is patent nonsense. (Essentially, having noncomparable norming populations does two things: shifts the mean around, and changes the standard deviation, but in ways we just can't know without having access to the original norming group.)
c. A gain score is a questionable basis for statistical analysis.
d. Since the standardized tests are taken at least one, and in some years almost two, months before the end of the year, gain scores conflate the effects of two different teachers.
e. VAA may seriously underestimate the effects of prior knowledge, social background, etc. I know of no analysis to date of the implications of model misspecification for VAA -- as far as I know, Sanders just plugs in last year's scores, this year's scores, the districts, schools, and teachers, and lets the program rip, without reference to race, sex, gender, or even the possibility of a nonlinear relationship between last year's score and this year's. Would the effects be different if you put in sex, race, economic class, perhaps a square of last year's scores, in the equation? I bet no one knows.
f. VAA is not an evaluation system accessible to teacher understanding. It is as abstruse and anxiety-provoking as anything one could imagine. An incentive for better teaching? Absolutely not, from my experience working with teachers. An incentive for absolute panic? You bet.

The reason for rewarding teachers as individuals is that nearly all of them work as individuals; the Deming approach is fine for situations where you only have groups as the work unit.

The question is not whether people work as individuals or groups, but whether the *work* is individual or group. And, unless I'm wrong, the socialization and education of children is not done by an individual but by lots of people. Is the third-grader who reads poorly the fault of the third-grade teacher? Perhaps a bit, for plenty of third-grade teachers will refuse to teach a child who cannot read. But it is certainly the fault of prior teachers. Same for a high-achieving child in a subject.

And the status quo of individual teaching does not justify the use of individual incentives; we need to compare individual merit-pay schemes with more collective incentive systems combined with a change in teaching structure. The fact that teachers do not work together now very often is not a justification for disincentives from working together forever.

-----------------------------

Date: Thu, 27 Oct 1994 13:22:16 LCL
From: "William L. Sanders"

Last week, a response to one of Scriven's comments by Dorn (9- 9-94) was called to my attention.

I am writing in response to this totally erroneous and distortive description of the Tennessee Value-Added Assessment System (TVAAS). Never in my 28+ years working in the academic arena have I read or seen such a blatant misrepresentation of the truth--a misrepresentation which, I hope, is based upon a lack of knowledge instead of a deliberate attempt to sabotage. I want to response to this gross misrepresentation point by point.

First, I am not an agronomist; I am a statistician. For many years, I have had the responsibility for the Statistical and Computing Services Unit within the Agricultural Experiment Station. The University of Tennessee being the Land-Grant university of Tennessee, not unlike Iowa State or Purdue or NC State, etc. has supported statistical consulting and research through its Experiment Station for many years. I am also an Adjunct Professor in the Department of Statistics which at UTK is in the College of Business. I let my professional record and research speak for itself. My research interests are and have been in the areas of: 1. experimental design and 2. statistical mixed models. I came into the educational research arena in the early '80s quite by accident; but due to an apparent need to solve many of the statistical problems which educational types were citing, this has become in the past 5-6 years my primary focus.

What is so distressing about Dorn's comments is the blatant lack of knowledge about our work and methodology. He claims to know me; as I write this, I would not know Sherman Dorn if he walked into my office. I have made over 200 presentations conceptualizing this approach; if I met him during and after one of those presentation, I certainly do not remember it. If he is so committed to 'trash' this approach at least he should have the common decency to make accurate and objective academic arguments without relying on petty slurs and innuendos.

Second, his attempt to describe the modeling approaches are totally inaccurate. We do not even calculate simple gains. For example, we use the whole observation vector for each child over all subjects and grades. In fact, we have a 'small' article later presented at an American Statistical Association meeting that demonstrates how this approach is superior to traditional multivariate approaches in that the whole observational vector is not lost due to one missing value for a variable.

But the most telling comment of his ignorance of the TVAAS process was when he indicated that he did not know what was 'random' in the model. The process is built on the formulation of the late C.R. Henderson, a Cornell animal breeder who was named a fellow in the American Statistical Association for his pioneering work in this area. Henderson's development of the concept of best linear unbiased prediction (BLUP), has been shown to be related to other shrinkage estimator concepts by David Harville at Iowa State and others (i.e. many bayesian concepts, kalman filtering from the engineering sciences, some of the hierarchical linear model concepts, etc.) . However, we have found that Henderson's formulations have tremendous computing advantages over some other equivalent alternatives. In fact, while I was a consultant to SAS Institute, Inc., as they were planning and implementing their rather new MIXED procedure, I strongly recommended that they use this formulation in their development and manuals -- a recommendation which was accepted.

As we apply these approaches in the context of the estimation of the teacher and school effects on the academic growth of populations of students, we take advantage of the prior knowledge of the distribution of the variance-covariance structure among populations of teachers, as well as the variance-covariance structure among students. By so doing, we solve the following problems:

2. teachers changing assignments
3. modes of instruction (team teaching, departmental instruction, self contained classrooms)
4. use of dissimilar indicator variables over time.
5. combinations of different quantities and qualities of information
etc., etc., etc., etc., .......

Dorn indicated that "since the standardized tests are taken at least one, in some years almost two, months before the end of the year, gain scores conflate the effects of two different teachers". This is another example of his lack of knowledge of the process. We have developed a process which we call the 'stacked block concept' which enables the partitioning of these effects. This concept is not totally dissimilar to the recovery of interblock information from incomplete block designs.

I invite any of you, including Dorn, to call and we will supply you with as much detail and information as you wish.

As to his concern with regard to model specification, we have done a tremendous amount of computer simulation to evaluate just that concern, and we can demonstrate to any reasonable person that this process is very robust against rather severe mispecification.

We welcome legitimate and qualified criticism. For the past four years, we have developed and implemented this process for the whole state of Tennessee. Presently, the data base contains over 3 million records longitudinally merged for the entire state's student population. To provide these estimates for all schools (and next year all teachers grades 3-8 for the entire state), the solution to tens of thousands of equations is required. We are doing this computing on a dedicated RS-6000 workstation. Because of this work crunch, our writing has lagged; however, we do have a sufficient amount of written material to enable informed comment. Informed comment, we welcome; hatchet jobs like Dorn just pulled, have no place on the internet. I describe his recent message on the board as worst than political dirty tricks. That kind of activity is not consistent with the academic reputation and tradition of one of Tennessee's treasures, Vanderbilt University.

======================================================

Date: Fri, 28 Oct 1994 07:41:07 -0500
From: SHERMAN DORN

I will reply at some length to Prof. Sanders' comments later, but just a few points here:

1. I have searched in several places for peer-reviewed descriptions of the Tennesee Value-Added Assessment System (TVAAS), as well as looking at the text of the legislation in Tennessee Code Annotated. Until a member of the staff at the center producing the assessments told me last week of an article in the *most recent* Journal of Personnel Evaluation in Education, I have been unable to locate a single article about it through ERIC or the other periodical indices I have searched. What I described in September was the result of information I had at the time -- which, by the way, was more than what teachers and principals are given to understand about TVAAS. The legislation was passed several years ago; the fact that I have been unable to locate peer-reviewed articles on the subject using abstract indices in 1994 is not evidence of laziness or misrepresentation on my part, I believe.

(It is not evidence of either on the part of TVAAS center staff, either, just not evidence about my actions.)

2. I stated very clearly that I *knew* of no attempts to gauge the results of model misspecification and that I thought it would result in wide discrepancies -- and by this I mean discrepancies in the rankings given schools. My question about what are the random or fixed effects stands: the fact that the TVAAS uses a mixed-model methodology which SAS now uses does not tell us whether the specific models used in calculating TVAAS make sense in the real world. I know SAS has that model, and I have read the article in American Statistician which Sanders co-authored and which is cited in the legislation as criteria for acceptable statistical models. The fact that Sanders' mixed-model methodology is a good general approach to mixed models (where the effects of some things are fixed and others are random, with the possibility of interactions among the levels) is irrelevant to the question of whether the model MEANS anything.

3. Prof. Sanders' statistical expertise does not answer the policy-related questions which I raised at the time -- or the additional ones I have now.

Quickly re-reading my comments in September, I regret implying that Sanders' expertise was in agronomy and not in statistics only. I will confirm that Sanders left me a message several days ago, and I have sent an e-mail message explaining why I have not returned his call. (I did so yesterday, before I saw his message. The two may have passed ... rather, the fact that we did not see the other's message does not imply anything except that the Internet is not instantaneous.)

(Yeesh, please mentally delete "only" from "in statistics only." I forgot to use the "/edit" command for this, and can't re-edit lines above.)

As mentioned above, a TVAAS staff person has pointed out to me a recent article, and I will wait until I receive it and read it before I comment further. I have also recently become aware of Sanders' presentation to the CREATE folks at Western Michigan State last year and (according to one educator I know) of presentations about TVAAS this summer. Unfortunately, the CREATE gopher site at Western Michigan is not yet up-and-running, and the campus' gopher does not have a working faculty directory. (CREATE is one of the federally-sponsored research centers for education, and although I forget what the acronym stands for, it is about assessment and evaluation.)

====================================================

Date: Fri, 28 Oct 1994 12:02:53 MST
From: Gene Glass

Dear Professor Sanders:
I like statistics; I made the better part of my living off of it for many years. But could we set it aside for just a minute while you answer a question or two for me?

Although I have read little about the Tenn Value Added Assessment system, I gather that it is a means of measuring what it is that a particular teacher contributes to the basic skills learning of a class of students. Let me stipulate for the moment that for your sake all of the purely statistical considerations attendant to partialling out previous contributions of other teachers' "additions of value" to this year's teachers' addition of value have been resolved perfectly--above reproach; no statistician who understands mixed models, covariance adjustment and the like would question them. Let's just pretend that this is true.

Now imagine--and it should be no strain on one's imagination to do so-- that we have Teacher A and Teacher B and each has had the pretest (Sept) achievement status of their students impecabbly measured. But A has a class with average IQ of 115 and B has a class of average IQ 90. Let's suppose that A and B teach to the very limit of their abilities all year long and that in the eyes of God, they are equally talented teachers. We would surely expect that A's students will achieve much more on the posttest (June) than B's. Anyone would assume so; indeed, we would be shocked if it were not so.

Question: How does your system of measuring and adjusting and assigning numbers to teachers take these circumstances into account so that A and B emerge with equal "added value" ratings?

====================================================

Date: Sat, 10 Dec 1994 20:55:23 CST
From: William Sanders

The process is built on the formulation of the late C.R. Henderson, a Cornell animal breeder who was named a fellow in the American Statistical Association for his pioneering work in this area. Henderson's development of the concept of best linear unbiased prediction (BLUP), has been shown to be related to other shrinkage estimator concepts by David Harville at Iowa State and others (i.e. many bayesian concepts, kalman filtering from the engineering sciences, some of the hierarchical linear model concepts, etc.) . However, we have found that Henderson's formulations have tremendous computing advantages over some other equivalent alternatives. In fact, while I was a consultant to SAS Institute, Inc., as they were planning and implementing their rather new MIXED procedure, I strongly recommended that they use this formulation in their development and manuals -- a recommendation which was accepted.

1. fractured student records
2. teachers changing assignments
3. modes of instruction (team teaching, departmental instruction, self contained classrooms)
4. use of dissimilar indicator variables over time.
5. combinations of different quantities and qualities of information
etc., etc., etc., etc., .......

We do not calculate simple gains. For example, we use the whole observation vector for each child over all subjects and grades. In fact, we have a 'small' article later presented at an American Statistical Association meeting that demonstrates how this approach is superior to traditional multivariate approaches in that the whole observational vector is not lost due to one missing value for a variable.

{Some have worried} that "since the standardized tests are taken at least one, in some years almost two, months before the end of the year, gain scores conflate the effects of two different teachers". We have developed a process which we call the 'stacked block concept' which enables the partitioning of these effects. This concept is not totally dissimilar to the recovery of interblock information from incomplete block designs.

We have done a tremendous amount of computer simulation to evaluate just that concern, and we can demonstrate that this process is very robust against rather severe mispecification.

Question: But is there any way that you or someone on your staff can explain in some sort of intuitive way how the numbers you are crunching have anything to do with children's learning, and teachers' teaching -- and how you can know that for certain. I am not a statistician, but I do know something about the relationship between math and the physical world, and I know that just manipulating formuli will not always produce results that reflect what occurs in the physical world. How can you know that what you are measuring is what you think you are measuring? Especially since it seems unlikely we can directly test your results to see whether you have "it" right; since there is no "it" apart from your results that we can directly observe as verification.

Answer and further introduction: The Tennessee Value-Added Assessment system was developed on the basis of three assumptions: Fair and valid assessment is necessary to program improvement; such assessment models can be constructed; and it is reasonable to expect that children grow academically at a rate commensurate with their peers if effective instruction is taking place. None of these assumptions is original with us, but the way TVAAS addresses them is new.

Because it uses metrics that have long been suspect, at least in part due to the fact that previous methodology was incapable of rendering the necessary degree o fairness and validity, TVAAS is likewise looked upon with suspicion. That's to be expected. However, TVAAS in not plagued by the flaws of previous models, and it may be that a real paradigm shift is in order here. This is why:

1. TVAAS utilizes the scaled gain scores students make over time in order to model their learning patterns. In this way, it is possible to note when the normal pace of academic growth deviates. Obviously, this cannot be accomplished by figuring simple gains for each child. TVAAS uses the entire observational vector for each child across time. By using all the available data, TVAAS estimates the effects of educational entities without having to calculate the gain for individual kids. The records of students with fewer data points are weighted more lightly than those of students with complete sets in the calculation of educational effects, but the data of students with incomplete sets are not ignored.
The advantage of following growth over time is that the child serves as his or her own "control." Ability, race, and many other factors that have been impossible to partition from educational effects in the past are stable throughout the life of the child. By taking the child where we find her; by aggregating the gains of cohorts of students in school systems, schools, and classrooms; and by constructing covariance matrices among teachers in a school, schools in a system, and systems in a state, we can fairly and reliably attribute educational effects.
2. The scaled gain scores are derived from the norm-referenced items (the CTBS/4) of the Tennessee Comprehensive Assessment Program (TCAP). TCAP is administered state wide to all students in grades two through eight and ten. Scores from science, math, social studies, language arts, and reading furnish the data for TVAAS. The Education Improvement Act, the legislative package which identifies TVAAS as the model for educational accountability in Tennessee, mandates that it derive its data from "fresh, non-redundant, equivalent tests," insuring that "teaching to the test" is minimized and perhaps eventually extinguished due to lack of effectiveness.
Of course, there is a very vocal group that excoriates any use at all of standardized test data. We do not contend that the CTBS/4--or any other assessment tool, standardized or otherwise--can accurately depict the totality of a child's learning experience. However, we believe that the information it does provide is valuable. On the other hand, if better measures are found, TVAAS can easily utilize them in addition to or instead of the CTBS/4, so long as they provide linear measures with appropriate statistical properties.
3. TVAAS does not prescribe any particular model for effective instruction. Teachers are free to teach as they see fit. As long as their students make appropriate gains, teachers can assume that they are meeting the needs of their pupils, regardless of the method they choose.
4. TVAAS encourages appropriate instruction for all students. We all know that students enter the classroom with various levels of preparedness. It is not expected that all students will perform at the same level of competence or will achieve the same outcomes from a year of instruction. It is assumed, however, that each child will achieve the normed gain for his or her age and grade. Four pilot studies and three years of state wide reports have verified that this is a reasonable assumption since gain has been shown to be unrelated to level of ability, racial make-up of the student body, or socio-economic indicators such as percentage of students who receive reduced-price and free lunches. What this means is that if each child is taught according to his or her ability and level of preparedness, normal gains are to be expected. We think this attention to the needs of the individual child is crucial. Schools, systems, and teachers who do best under TVAAS are those who address the needs of all their students.
5. TVAAS can and does furnish far more than the sterile reports schools are used to receiving as a result of standardized tests. Each school in Tennessee will receive, beginning this year, a report broken down by student achievement level, indicating the gains of students in four to five divisions, ranging from low to high. Average gains for each achievement level in each tested subject and grade will be provided. This allows schools to pinpoint areas in which they are doing well and areas that need attention. For instance, a pattern of high gains for low achievers and low gains for high achievers might suggest that a school was doing well by its slower students but might not be offering enough enrichment or accelerated classes for its high achievers. Schools can also assess their programs in specific subject areas and grades. All of this is essential to effective program improvement.
6. An outgrowth of TVAAS is an enormous data base of educational information. Currently, there are over three million pieces of merged data in the data base. This unique resource allows for an unprecedented perspective on educational phenomena. It is already yielding significant findings and promises to continue to enlighten research on a tremendous range of educational issues for many years to come. The University of Tennessee Value-Added Research and Assessment Center hopes to form research collaboratives to investigate the findings deriving from the TVAAS data base as well as educational research questions originating from other sources.

{We wish to} encourage you to think about collaborative research projects. I don't think I'm stretching it when I say that the TVAAS data base is an educational treasure, totally unprecedented in educational research. Consider the fact that we are talking about a state wide population of students and teachers--not a sample. We want it to be used.

Question: You talked about the question I sent, but did not quite answer it, however. Remember I am not a statistician, so it may be difficult for you to answer what I am asking, but it is the question of how you know that your method of analysis really does isolate and "attribute educational effects." For example, suppose that second grade teachers teach material in such a way that second graders seem to make normal gains, but that in actuality the second grade teacher is teaching "surface tricks" or algorithms that make it difficult for a child, in third or fourth grade, to learn the material there because they were not given an important kind of understanding then. Would TVAAS likely point to second grade as the problem? Also, I can see that TVAAS can give EVIDENCE for problem educational areas; do you claim it is CONCLUSIVE evidence or just an indicator that some specific investigation may need to be done in what seems to be a problem place? And, from what you said, it makes it seem (correct me if I am wrong) that teachers are not held accountable for making the kinds of gains with culturally deprived kids that teachers would make in culturally advantaged areas -- unless some teachers in culturally deprived, similar areas also made much better gains. Did I understand this correctly? What I am getting at here is how the assessment is used with regard to evaluating rewarding/ punishing/weeding out teachers or improving teaching methods, etc.

On re-reading your post about TVAAS I noticed one particular sentence I want to ask you about, the one that says teachers, schools and systems who do best under TVAAS are those that address the needs of all their students. This is the kind of thing I was asking about --independent verification that the people TVAAS says are best ARE best. What led you to know that this correlation you mention exists? And would not that independent kind of verification work in place of, or better than, TVAAS? What I am getting at is how do you know that the teachers your methods say are doing the best ARE the teachers who are doing the best?

Further question (Gene Glass): I gather that TVAAS is a means of measuring what it is that a particular teacher contributes to the basic skills learning of a class of students. Let me stipulate for the moment that for your sake all of the purely statistical considerations attendant to partialling out previous contributions of other teachers' "additions of value" to this year's teachers' addition of value have been resolved perfectly--above reproach; no statistician who understands mixed models, covariance adjustment and the like would question them. Let's just pretend that this is true.

Now imagine--and it should be no strain on one's imagination to do so-- that we have Teacher A and Teacher B and each has had the pretest (Sept) achievement status of their students impeccably measured. But A has a class with average IQ of 115 and B has a class of average IQ 90. Let's suppose that A and B teach to the very limit of their abilities all year long and that in the eyes of God, they are equally talented teachers. We would surely expect that A's students will achieve much more on the posttest (June) than B's. Anyone would assume so; indeed, we would be shocked if it were not so.

So the question is: How does your system of measuring and adjusting and assigning numbers to teachers take these circumstances into account so that A and B emerge with equal "added value" ratings?

Answer: > You talked about the question I have, but did not quite answer it, >however. Remember I am not a statistician, so it may be difficult for you to >answer what I am asking, but it is the question of how you know that your >method of analysis really does isolate and "attribute educational effects." >For example, suppose that second grade teachers teach material in such a way >that second graders seem to make normal gains, but that in actuality the >second grade teacher is teaching "surface tricks" or algorithms that make it >difficult for a child, in third or fourth grade, to learn the material there >because they were not given an important kind of understanding then. Would >TVAAS likely point to second grade as the problem?

More than likely. Although it is true that there are algorithms that improve test taking skills to some degree, large gains over all subject areas attributable to such "tricks" have not been documented. As to whether we can isolate teacher effects, the answer is yes. I am not a statistician, either. I am a theoretician who interprets TVAAS to non-statisticians. Therefore, I will tell you what this model does without telling you how. If you want to know how, re-ask and I will restate.

TVAAS, in determining teacher effects, the first reports of which will be issued in 1995, uses the gains of at least three years of a teacher's students as data. This information is entered in a covariance structure that includes the performance of these students under their previous and subsequent teachers, as well. Since we can follow students over time, we can examine whether one teacher is injecting an artificial high. If a teacher's students make tremendous gains above those achieved under their previous teachers and then "fall off the table" the year after, we may suspect that something bogus is happening. The software is engineered to call such data to our attention. You have to understand that we are talking about whole classes of students for three or more consecutive years and that, usually, these students are disbursed to many subsequent teachers, not just one who is offering substandard instruction. At this point you may be getting your first glimpse at how revolutionary TVAAS is. These data are computed simultaneously for all students in grades two through eight in Tennessee, along with the covariant matrices for teacher, school, and system effects. As long as a child takes TCAP, we can follow him or her from school to school, system to system, and teacher to teacher.

> Also, I can see that TVAAS can give EVIDENCE for problem educational areas; > do you claim it is CONCLUSIVE evidence or just an indicator that some >specific investigation may need to be done in what seems to be a problem >place?

Excellent question, and the answer is that TVAAS supplies indicators that something is going well or ill in specific areas, at least in so far as certain subject-specific achievement in specific grades. TVAAS furnishes this information to schools, systems, and, next year, to teachers for grades 3-8 in science, social studies, math, language arts, and reading.

> And, from what you said, it makes it seem (correct me if I am wrong) that > teachers are not held accountable for making the kinds of gains with > culturally deprived kids that teachers would make in culturally advantaged > areas -- unless some teachers in culturally deprived, similar areas also > made much better gains. Did I understand this correctly?

Absolutely not. As I told you, where a kid starts does not influence the expected gain for that child. Deal is, all those schools in poorer areas where the students enter disadvantaged can look good for once if they achieve normal--not to mention extraordinary--gains. This is really important. Some of our inner city schools are, for the first time, able to demonstrate that they are bringing their kids along much faster than our suburban showcases. In the past, because scale scores labelled these kids "substandard," no matter how much progress they made, they always compared poorly to kids from enriched environments who started out miles ahead of them. Now, that's not an educational effect. That's environmental. If we're trying to see what schools are doing, we have to look at growth. Some of our higher scoring schools and systems are finding out that, unless their students make gains, even if they scale well above the norm, they don't do well under TVAAS because they're failing their kids by letting them coast. Our best systems achieve appropriate gains even in their very top scorers. TVAAS was developed to insure that EVERY child, regardless of ability, achieved academic gains normal for their peer group.

> What I am getting at here is how the assessment is used with regard to >evaluating rewarding/punishing/weeding out teachers or improving teaching >methods, etc.

Hmmm. Well, there's a complex question. In Tennessee, there are state and local evaluation models for performance evaluation of teachers. Those involve a combination of principals, supervisors, and state evaluators, depending on which cycle it is and what the teacher requests. We also have a Career Ladder which is a series of performance evaluations, dialogs augmented with "evidence" which is something akin to a portfolio, professional development initiatives, and leadership surveys as well as principal surveys. The EIA (Education Improvement Act) states that TVAAS cannot be the sole reason for dismissal of a teacher, and, while school and system reports are public record, teacher reports are not and are provided only to the teacher and "appropriate administrators," under the law. So far, and for as far as we can see, TVAAS is not part of a teacher's formal evaluation. It is, instead, a tool for self- and program evaluation.

You previously asked how we knew whether the teachers we were evaluating as best really were the best. Well, once again, the first teacher reports don't come out until next year. However, in pilot studies done in the early eighties, when principals were asked to predict whether their teachers would be in the top, middle or bottom third as assessed by the prototype of TVAAS, they were abe to predict the bottom third of teachers in all subjects; there was good discrimination between the top and average group of math teachers; but they had no clue as to who their top and average language arts teachers were. That's very anecdotal, to be sure. Interesting, but anecdotal. The reason I mention it is to say that TVAAS is an indicator. Professional knowledge, performance assessment, TVAAS--all of these are ways of knowing, and although they are correlated, I contend that their emphases are different. Performance assessment concentrates upon process. Professional knowledge of a teacher by an administrator may also be process of a different sort. TVAAS is product oriented. We look at whether the child learns--not at everything s/he learns, but at a portion that is assessed along the articulated curriculum, a portion each parent is entitled to expect an adequately instructed child will learn in the course of a year. Of course a child will learn more, and we can use that information in TVAAS, so long as it furnishes linear metrics with appropriate statistical properties (I've said that before). Right now, we've merged the data for all the kids that have taken ACT and/or PLAN in the last 3 years and are about to enter data on the first year of state wide writing assessment in grades 4, 8, and 10. We're also in the process of developing subject-specific high school tests which have to be on-line by 1999. We're going to be examining the relationships among all of these data to determine whether some offer information not available from others, what the intercorrelations might be, and, of course, we'll be on the look out for the unexpected, too. Wanna help?

Response to this answer:

I did want to comment quite favorably on one of the things you said TVAAS does --pointing out schools and districts that are "coasting" on what I would call culturally advantaged students, that is, those districts who have "good" students to work with because of their home environments, etc. and who don't add as much to their education as they could, or develop the potential those students have. I am in an area where three of the perennially top four school districts (in the state) based on SAT etc. scores do not, I believe, do much to develop their students potentials or cultural advantage nearly as much as they should and could. I have argued for nearly twenty years that the school district that finishes in one of the top five each year, but has a much "lower" socio-economic mix of students with much less culturally advantaged background is probably the best school district of the bunch because it does more with the students it gets. So I was really excited to see that is something you seek as well.

Rick Garlikov (dems042@uabdpo.dpo.uab.edu)

=========================================================================

Date: Mon, 12 Dec 1994 13:11:16 CST
Question from Sherman Dorn:

First, I want to divide TVAAS as a research tool from TVAAS as an evaluation tool. There is no doubt that a database with several million records is an invaluable resource. In addition, I can't see anything inherently wrong with the mixed-model methodology as described, and quite a bit good. As a tool of policy, I have some additional qualms about the effects of a high-stakes evaluation instrument as the data and law currently exists. But for now I'm only going to talk about TVAAS as research and statistics. Besides, I hold the state legislature responsible for policy. So, my thoughts:

1) I think there are several good reasons to enter the prior year's scale score as a covariate with the current year's score as the dependent (okay, the matrix of scores, with the matrix of scores from the prior years as covariates). First, as Gene Glass mentioned on EDPOLYAN's mailing list, a good bit of research suggests that kids who do well on an omnibus test like the CTBS would expect to have higher gains. Second, the cross-sectional nature of norming on the CTBS, at least for the data as it currently is in TVAAS, makes gains in scale scores ambiguous in meaning: (a) gains between grades in 1989 may conflate gains one might expect with differences in cohort experiences; (b) the norming in 1989 did not represent what one would expect from truly longitudinal cohorts -- due to differences in populations, as well as grade retention, for example, the grade 3 norming population is not the grade 2 norming population aged a year. The two are incomparable. Both of these issues change the scale scores in a linear fashion, so entering the prior years' scores as covariates solves the problem. (Solving Gene Glass' conundrum means that one assumes a linear relationship between first set of scores and second set of scores, but that's much more tenable than assuming an expected gain that's constant across the distribution of first sets of scores.)

2) There are a few potential problems I can identify with the procedure TVAAS uses to include information from kids who are not completely observed (e.g., not taking the language arts test in fourth grade). From what I gather -- and correct me if I'm wrong -- the gain score in an unobserved domain is imputed from a combination of information about the location of the child (system, school, and teacher) and information taken from the child's scores in observed areas (i.e., the deviation from the gain one would expect for a child in that particular system, school, and teacher), and then the score is given a lower weighting in the algorithm. First, I assume that this is multiple rather than single imputation. Beyond that, it appears that you're assuming that observation of a single score is random (and thus it is reasonable to assume that a child's deviation on an unobserved test score would be the same as on the child's observed test scores). Have you tried other models -- specifically, that school officials may try to steer kids into being absent on days when the test they're weakest on is being conducted? For kids in special education, that bias is almost certainly true and is explicit in the ability of school officials to exclude kids with disabilities from tests. Finally, I'm concerned about the implications of weighting kids' scores differently --this might end up making kids who school officials expect to do well being more important in the TVAAS system, and thus giving an incentive for teachers to teach to kids who are already doing well. If the weights are something like .95 versus 1.00, that's pretty minor, but more severe weighting might give some perverse incentives.

3) I would hope that the TVAAS staff would urge the state legislature to revise the state code regarding program and personnel evaluation to eliminate the exclusion of students with disabilities from the core of the system. I've already heard reports of this creating some perverse incentives for schools to ignore the needs of students with disabilities, and it's part of why I asked about retained students. TVAAS staff are not responsible for state policy, but they can go a long way toward convincing legislators of the need to revise the statute so individuals with disabilities are not ignored in program and personnel evaluation.

The first two items are really the gist of the matter, and I'm curious as to your thoughts about them.

******************************************************

Answer from TVAAS (Tennessee Value-added Assessment System)

I don't have time to go into all your points right now, but I would like to point out, as I did in my last message, that special ed kids are included in the assessment of schools and systems. We could easily include them in teacher assessment, as well, but are precluded by law from doing so.

A point that is very salient here: all of our research has shown that Dr. Glass, if he indeed attributes higher gains to higher scorers, is mistaken. Gain is not predicted by achievement level. Low scoring students are as likely as high scoring ones to achieve normal gains, above-normal gains, and below-normal gains. Eliminating potential low scorers is ineffective in manipulating gain scores for cohorts of students. That's one of the good things about TVAAS--your low scorers must be attended to AND ALSO you must provide challenging materials for your high scoring students.

Please remember, too, that we do not merely compare one year's scores with another. The scores kids make are modelled into a description of what learning is normal for each child over time. Although we don't model the actual learning curve for individuals, it is a good metaphor for what happens deep within the calculations. If we have, say, ninety kids who drop from their normal learning curve in various years but under the same teacher or, conversely, if they rise under a certain teacher and gain from there the next year, we know something about that teacher. (If they rise and then show very low gains the next year, we know something else about the teacher in question.) As you know, in Tennessee schools, it is highly unusual, except in very small schools, for students to remain together from grade to grade--they tend to be dispursed among teachers. Therefore, within the model, we must also take into consideration the interaction among students among teachers among schools among system over time and calculated simultaneously.

Further, I must question how much "strategic cheating" goes on, especially to the point of asking students to stay home from the test. We are able to spot such wholesale manipulation of the testing situation as well as many other forms of cheating, should they occur. The law specifies immediate and harsh penalties for those engaging in such practices. If anything, the EIA has cut down on such practices, and it is very rare that such anomolies occur according to our best knowledge and that of the SDOE's Department of Accountability.

After you read this stuff, would you please reask the questions I didn't get to? I'll be right here.

**************************************************

Sherman Dorn:
If the data on Tennessee kids shows equivalent expected gains for low-performing and higher-performing kids, that would be very interesting -- and important -- research. (I still prefer using prior scores as covariates, because of the non-equivalence of norming populations in different grades -- unless it makes no difference in the order of estimated effects.)

That finding may not hold, however, for kids in special education. Because their participation in TCAPs is at local discretion, you may be getting a selection effect. (You may not be getting a selection bias; you just don't know without 100% participation.) If some system(s) mandates participation for kids labeled LD, MR, or SED, that would be a possible way of examining the issue (assuming the system has some way of accommodating their needs appropriately and that accommodation is held fairly constant across the years for a specific child). (LD=learning disabled, MR=mental retardation, SED=severe emotional disorder.)

Besides, my concern is less with conscious manipulation than the fact that high-stakes testing can feed people's prejudices and lead to perverse incentives regarding students with disabilities. Because of the local decisions about testing participation for individuals with disabilities, teachers and building principals may write off some kids in *hopes* of improving test scores for others. (The hope doesn't even have to be rational.) Assistant Commissioner Cannon acknowledged this week in a public forum that he's heard stories of this type from across the state; I've heard the same thing locally from special ed teachers. It's very disconcerting when in a group of kids you know, the vast majority were missing at least one TCAP score in reading or math -- when this test is the major instrument of personnel evaluation. Without their mandatory participation in *some* form of assessment, their needs will come after those of nondisabled students; it's just the incentives their exclusion (either mandatory or discretionary) creates.

But, then again, this issue of incentives is a policy matter.

*************************************

TVAAS:
Whether special ed student are required to take the test is something over which we have no control. As I said before, their gains are, by law, a factor in the determination of school and school system effectiveness, so it is certainly in the best interest of principals and superintendents to meet the needs of these students.

I can do nothing whatever about superstition and wrong-headedness of individuals. I can only hope, and with good reason, I think, that systems such as TVAAS will provide an incentive for real improvement for all children. That is the basis upon which it was created.

Mr. Cannon may have "heard of" such instances. We have all "heard" of a lot of things, some of which have basis in reality and some of which don't. Document such instances and they will be examined by the Department of School Accountability. We want such things stopped as much as you do.

As for finding gains unrelated to level of achievement, yes, that is an interesting finding, isn't it. And our n's are huge. We invite and encourage use of the TVAAS database for educational research. Perhaps you have a point you would like to examine?

************************************************

(The following message also came to me off-list from someone else, and it relates to this matter. I have asked TVAAS for clarification, since I am confused about what the law or policy is regarding special ed students' taking, or being exempted from taking, the test that TVAAS uses to make its evaluations. Rick Garlikov)

I read your posting about TVAAS with great interest and have followed up to get some information about students with disabilities in the system. I learned the following.

The Reform Act provided that *all* students with disabilities were to be exempted from testing in the TVAAS. That has been changed to allow local option as to whether or not they are included. There are no consistent criteria applied to make that decision - it is up to the local school and the assessment team. (The TN special education program includes the gifted and they are not exempted from testing under the Act).

If students with disabilities are included in the testing, there will be 2 different reports issued - one with their scores included and one without.

I understand the TVAAS is causing a lot of consternation especially among teachers. As I understand it, the first year of the system, the *Report Card* was issued only for the school systems and in year two, it included data for each school in each system. This is now year three and the data will be specific to each teacher in the system as well at the end of this year.

I look forward to finding out more about this topic in your future postings. ***************************************************

Rick Garlikov (dems042@uabdpo.dpo.uab.edu) ========================================================================= ========================================================================= Date: Tue, 13 Dec 1994 06:49:48 -0600 From: SHERMAN DORN

I can clarify somewhat the situation regarding students with disabilities. The 1992 law establishing TVAAS says nothing about students with disabilities except in the section on "estimates of teacher effects" -- i.e., reports on individual teachers. *There* the legislature excluded the scores explicitly from the calculation of teacher effects.

According to the Tennessee Value-Added and Research Center, they inclu

de any tests from individuals with disabilities in the calculation of school and school system effects. That is, if a person took a test. However, it is up to the discretion of local officials as to whether individual students with disabilities will take the battery of standardized tests that are the basis of the Tennessee Value-Added Assessment System. Tennessee only has a limited number of accommodations possible in the administration of the TCAPs. In part because of those limited accommodations, in part because the TCAPs are a VERY LONG battery of tests (schools only administer one part per day), in part because of real or imagined fears on the part of local officials, many *many* students with disabilities either do not take the tests at all or only take a few sections. (I could call the Division of Accountability to ask about the actual numbers, though it may take a while for me to get around to it during the workday.)

In addition, as I wrote to the Tennessee Value-Added and Research Center, the way they explain their imputation procedure it looks as though they assume all missing scores are missing at random, and then they weight the missing scores less than observed scores. I think that a random assumption is poor for students with disabilities, and that (depending on the weighting factors) lower weights for the missing scores *still* gives incentives for local officials to exclude individuals from some tests.

(By the way, there is a group which does research precisely on the inclusion of students with disabilities in assessment systems. It's the National Center for Educational Outcomes, located in the Dept of Special Education at the University of Minnesota. They have a slew of reports available through ERIC.)

Sherman Dorn =========================================================================

Date: Tue, 13 Dec 1994 20:08:35 CST From: Rick Garlikov *************************************************************************

Regarding TVAAS and Students with disabilities: From Sherman Dorn:

According to the Tennessee Value-Added and Research Center, they include any tests from individuals with disabilities in the calculation of school and school system effects. That is, if a person took a test. However, it is up to the discretion of local officials as to whether individual students with disabilities will take the battery of standardized tests that are the basis of the Tennessee Value-Added Assessment System. Tennessee only has a limited number of accommodations possible in the administration of the TCAPs. In part because of those limited accommodations, in part because the TCAPs are a VERY LONG battery of tests (schools only administer one part per day), in part because of real or imagined fears on the part of local officials, many *many* students with disabilities either do not take the tests at all or only take a few sections. (I could call the Division of Accountability to ask about the actual numbers, though it may take a while for me to get around to it during the workday.)

TVAAS:

As for who gets tested among special ed students, I really don't know, and I'm going to have to seek that information from State Department sources. As I've said before, the EIA says that special ed students are excluded from the TVAAS assessment of individual teachers, but their scores are included in the assessment of schools and school systems. I checked with John Schneider who works with the raw data, and he says that the data we receive distinguishes special ed students only by the time they spend in special ed classes per week. These students fall into one of four groups depending upon the amount of time they spend in special ed., but we don't use that data because whether they are included or not is a question of "any" or "none," not "to what degree." Therefore, gifted are included or excluded along with the rest of special ed kids since they are not identified as gifted on the data collection sheets; and all special ed, regardless of the hours are treated in the same manner. =========================================================================

Date: Thu, 15 Dec 1994 19:32:19 CST From: Rick Garlikov Subject: Re: TVAAS #3

From TVAAS:

I need to answer a couple of points I haven't addressed in the replies below.

On Tue, 13 Dec 1994, Rick Garlikov wrote: > ************************************************************************* > Regarding TVAAS and Students with disabilities: > >From Sherman Dorn: > I can clarify somewhat the situation regarding students with disabilities. The > 1992 law establishing TVAAS says nothing about students with > disabilities except in the section on "estimates of teacher effects" -- > i.e., reports on individual teachers. *There* the legislature excluded the > scores explicitly from the calculation of teacher effects. >

> According to the Tennessee Value-Added and Research Center, they > include any tests from individuals with disabilities in the calculation of > school and school system effects. That is, if a person took a test. > However, it is up to the discretion of local officials as to whether > individual students with disabilities will take the battery of > standardized tests that are the basis of the Tennessee Value-Added > Assessment System. Tennessee only has a limited number of > accommodations possible in the administration of the TCAPs. In part > because of those limited accommodations, in part because the TCAPs are > a VERY LONG battery of tests (schools only administer one part per > day), in part because of real or imagined fears on the part of > local officials, many *many* students with disabilities either do > not take the tests at all or only take a few sections. (I could > call the Division of Accountability to ask about the actual numbers, > though it may take a while for me to get around to it during the > workday.)

I spoke to State Testing today about the administration of TCAP to special ed students. They said that whether a special ed student is required to take the test is not based on the decision of "officials," as such. In other words, special ed testing is not decided on a group basis. Instead, whether or not a special ed student is tested is a decision based on the recommendations of that student's M-team, a group composed of the student, his or her parent or guardian, the student's teachers (some or all), guidance and/or social/psychological support personnel, and, sometimes, an administrator. This team also decides whether any modifications in the testing need to be implemented such as individual administration, provision of large-type editions, provision of a reader, etc.

> > In addition, as I wrote to the Tennessee Value-Added and Research > Center, the way they explain their imputation procedure it looks as > though they assume all missing scores are missing at random, and then > they weight the missing scores less than observed scores. > I think that a random assumption is poor for students with > disabilities, and that (depending on the weighting factors) lower > weights for the missing scores *still* gives incentives for > local officials to exclude individuals from some tests.

Not at all. If a student is missing all scores, that student doesn't appear in the computations. If we have some data on a student and he or she does not take the test for some reason in some year, we do not assume a "generic" score. The missing data is specific to that child and is a projection based upon past performance. Furthermore, as subsequent data are collected, the "missing" score is modified to be a better estimation of what *that* student would have scored and all computations incorporating that score are also modified to reflect the new estimate.

=========================================================================

Date: Fri, 16 Dec 1994 06:56:25 -0600 From: SHERMAN DORN

Rick Garlikov asks:

>Why does TVAAS use (or bother to make) projected scores at all? Why not >just say something like "With 91% of the students taking the exam, the >scores are ....." What is the purpose of projected scores? And don't >projected scores defeat the purpose of taking the exam?

There are two general reasons why one would impute scores for any statistical purpose. (Impute means to plug in numbers for the missing values.) One is technical: you are underestimating the variance of a parameter if you only rely on complete records. The other is common-sense: using only complete-data records does not give the whole picture. You may be getting a biased picture by only including records of individuals with all tests.

As I understand it, the proper way to do imputation is to decide on a few a priori models of censoring (or the pattern in which data is missing) and then test each of those models through a process called multiple imputation. I'm going to ignore multiple imputation and only talk about the models.

Some models of missing values are essentially random: in the case of test scores, you may reasonably assume that some kids are going to be sick on the day of the test, and that such illnesses will be distributed randomly. On the other hand, some patterns are *not* random, and that is what I believe to be the case with individuals with disabilities. I am fairly sure that, for students in special education, they are much less likely to take a subject test (even with accommodations) if they are performing very low in that subject.

Now, what are the models that TVAAS uses? According to the article by Sanders and Horn in the Journal of Personnel Evaluation in Education (1994), the model for all student scores is partitioned into school system effect, school effect, teacher effect, and child deviation from the other effects (and maybe an error term; I'm at home while I type this). As I read the article, the imputation model plugs in a score for a particular subject that is a combination of school system effect, school effect, and teacher effect for *that* subject and child deviation *on other subjects and other years* (i.e., observable data). That seems to me to be a random model of censoring; it's fine if the child's deviation is randomly distributed across subject areas; it is also fine if missing values are randomly distributed (and thus you wouldn't expect the deviation for any missing value to be greater or worse than deviations for observed values).

However, if my supposition is right, it is a poor model of censoring for individuals with disabilities, because the child's deviation is likely to be nonrandomly distributed, and you should expect much lower values for deviations in the missing scores. As I understand it, the TVAAS imputation model could be overestimating the scores for individuals with disabilities.

Sherman Dorn

========================================================================= =========================================================================

Date: Fri, 16 Dec 1994 07:41:22 -0600 From: SHERMAN DORN

The Tennessee Value-Added and Research Center writes:

>I spoke to State Testing today about the administration of TCAP to >special ed students. They said that whether a special ed student is >required to take the test is not based on the decision of "officials," as >such. In other words, special ed testing is not decided on a group >basis. Instead, whether or not a special ed student is tested is a >decision based on the recommendations of that student's M-team, ...

I stand corrected. I suspect that in some cases, while an individualized education plan (or IEP) decided by an M-team may be the document that *should* govern the taking of tests, parents and guardians may be pressured to agree to certain testing omissions. (IEP meetings are well-known in special education to be frequently concerned with compliance with federal law and less with good programming.) Would it be possible for the Tennessee Value-Added and Research Center to flag schools or school systems where the pattern of test score omissions seems particularly out of line (and this needs to be done by hours of service)? That would be a valuable service.

>> I think that a random assumption is poor for students with >> disabilities, and that (depending on the weighting factors) lower >> weights for the missing scores *still* gives incentives for >> local officials to exclude individuals from some tests.

>Not at all. If a student is missing all scores, that student >doesn't appear in the computations. If we have some data on a student >and he or she does not take the test for some reason in some year, we do >not assume a "generic" score. The missing data is specific to that child >and is a projection based upon past performance.

As I describe in my response to Rick, the notion of random imputation is not isolated to plugging in the mean value. Technically, I believe that TVAAS uses an "ignorable response model," which I think is inappropriate for individuals with disabilities.

=========================================================================

Date: Fri, 16 Dec 1994 17:16:29 EST From: Anne Louise Pemberton

Sherman, I find the problem with the disabled far less disturbing than you fear AS LONG AS the assigned scores have some basis on a/some former scores for the individual, and that efforts are made to arrange for "make-up tests" for missed tests if/when an avoidance pattern emerges on an individual student. ...

What concerns me is the application of the data is "rating" schools and especially individual teachers. If the tests are only administered every 3 years rather than annually, you are measuring ONLY the aggregate of the 3 teachers. I'm uncomfortable with the idea that you can magically/mathematically deduce exactly what each of the 3 accomplished individually. Perhaps I would feel more comfortable if there was additional data input from the teachers themselves, and others who physically observe the daily goings on. Things like noisy mechanical equipment, insufficient textbooks or supplies, room too hot/cold, dark/bright, etc. etc likewise limit the options a teacher has in maximizing her/his effort. Are these factors rated and reported? (Factoring out is still working with "unproven" factoring).

I'm not a statistician, but wonder if we have instruments and measurements for physical plants at the individual classroom level. Is it possible to rate an instructional facility? Is TVAAS doing it now? Should they be? Would a line on the report say, for instance: Mrs. Sunshine, Rm 323, Tiny Elementary.1993 Student gain .79 grade/age level, in a 1.38 physical facility.

=========================================================================

Date: Fri, 16 Dec 1994 16:58:58 CST From: Rick Garlikov

Regarding the question of why have projections about missing test scores:

TVAAS: That's a very important question, so I asked Bill to frame an answer. I am forwarding it to you, having added only the meanings of BLUE and BLUP for the benefit of the non-statisticians.

---------- Forwarded message ----------

One of the major advantages of the mixed model process is that it solves the 'fractured record' problem.

All students do not have complete test records for various reasons: illness, absenteeism, movement from the state etc., etc.. To complete longitudnal analyses, only those students with complete records would be included if traditional multivariate statistical processes were deployed. However, with the mixed model process all student records are included, yet still provide BLUE* of the fixed effects and BLUP* of the random effects. To achieve this, the prediction and inclusion of the missing values of the record for each student is NOT necessary. Rather BLUE and BLUP come directly from the solution of the mixed model equations. Perhaps what has generated some confusion on this point are our teaching examples which we have presented in some of our short courses. In these examples, we asked attendees to visual a situation in which we would use all other information to predict the missing value for each student, then 'plug in' those predictions with appropriate weights to obtain a conceptual approximation to BLUE.

As to Sherman's concern of differential weighting of special ed. students, even if some systems are 'playing' games, then the effect would be observed on the estimate of means, and should have trivial effect on the estimated gains for schools, systems, and teachers.

BLUE: Best linear unbiased estimator BLUP: Best linear unbiased predictor

=========================================================================

Date: Sat, 17 Dec 1994 09:10:28 CST> From: Rick Garlikov

It seems to me that Tennessee teachers probably would be worried about at least four things regarding the Tennessee Value-added Assessment System: (1) accuracy of the assessments, (2) administrative uses of the assessments, (3) meaning of the reports issued, and (4) significance of what is tested.

With regard to (1), the point that Sherman makes about projections, and my earlier question of independent verification of the statistical procedures used, it seems to me a test is readily available: withhold the grading of a randomly selected significant sample of students who DID take the test, make the projections TVAAS thinks they can make to whatever degree of accuracy about what those test scores probably are, then grade the tests and see how close the projections are. This would be an empirical test of a statistical methodology.

Regarding (3),
I would like to see sample reports for districts, schools, individual teachers, to see what kinds of things TVAAS says -- i.e., what the reports actually look like. I am interested in the intelligibility of the reports and what sort of form they take. If you could supply examples of a report, please, or some sample passages from a report.
Rick Garlikov (dems042@uabdpo.dpo.uab.edu)

=========================================================================
=========================================================================

Date: Mon, 19 Dec 1994 17:25:09 +0000
From: Harvey Goldstein

I have come in on what I gather is the tail end of a discussion of missing data in the analysis of TVAA system data to produce estimates of school effects. Apologies therefore if the issue has been discussed already, and also because I'm from a different educational system but one where we have had quite a lot of debate about value added analysis using longitudinal data.

I have two questions. The first refers to Rick Garlikov's message of Dec 16th which I don't really understand. If you are missing a test score which is the dependent variable in the particular model being used then the traditional procedure is to omit the case, here presumably the student. This is somewhat inefficient, however, (and ignoring the issue for now about bias) and there are a range of imputation procedures (e.g. multiple imputation a la Rubin etc) which attempt to make use of whatever is available. The same can be done for the predictor variables when missing. I do NOT think, however, that this is a standard by-product of the mixed model. What is a by-product of that model is when you have what is often referred to as a repeated measures design where response measurements on some occasions are missing...then a mixed model approach, as apposed to the traditional multivariate one, allows you to ignore just the missing occasions without throwing out each individual subject. As I understand it, however, the value added model is not a repeated measures one, but an analysis of 'outcome' scores, adjusting for prior achievement...or have I missed something important.

A second question....in the UK the value added debate has been looking at problems with the sampling errors (standard errors) of value added gain scores...it turns out that these are typically so large that you cannot make any statistically significant comparisons between most of your schools...only those at opposite extremes of a ranking. Is this also the case in Tenessee? If so what do you do about it when reporting?

=========================================================================

Date: Mon, 19 Dec 1994 17:45:47 -0500
From: Greg Camilli

I think that we probably aren't on the tail end of the debate on missing values. Regarding the TVAA procedures, we may be just beginning.

I have a few questions (not related to missing values). First, it was indicated earlier that the norm-referenced items used by TCAP are somehow related to the CTBS/4. In this regard, I'm wondering if items are sampled from the CTBS, or whether new items are being written at every assessment. The latter is suggested by the phrase "fresh, non-redundant, equivalent tests." My second question is regarding the CTBS scores per se. A number of different metrics are available from CTB: is TVAA using the IRT (developmental) score scale? Finally, there is the suggestion that the CTBS/4 scores are linear measures with appropriate statistical properties. I'm wondering how this was established. Is the contention here that the CTBS scale is known to be a linear metric? (More to follow after I receive the answer to this question.)

Perhaps I could also ask if my interpretation of the missing data procedure is correct. I understand that multiple imputation does not affect the values of "effect" coefficients. The models and methods used to obtain these bypass the estimation of individual scores. However, multiple imputation increases the standard errors of the estimated coefficients -- this simply refects the notion that with less than complete data, less is known about the parameter. Thus, multiple imputation is a post hoc adjustment to estimation. However, with multiple imputation one can always produce a "complete" set of data (where imputed values have replaced missing values) to expedite reporting or secondary analyses. (Please do not hesitate to correct errors in this formulation.)

=========================================================================

Date: Mon, 19 Dec 1994 19:39:31 -0600
From: Sherman Dorn

Harvey Goldstein writes:

>[An explanation of the difference between multivariate imputation a la Rubin and repeated measures >designs.]

Thanks -- no, no one has yet mentioned this. I will admit being relatively ignorant about repeated measures designs. I suspect that when Bill Sanders weighs in, TVAAS will be more akin to repeated measures.

>A second question....in the UK the value added debate has been looking at > problems with the >sampling errors (standard errors) of value added gain scores...it turns out that > these are typically so large that you cannot make any statistically significant > comparisons between most >of your schools...only those at opposite extremes of a ranking. Is this also the > case in Tenessee? If so what do you do about it when reporting?

I don't know if the Tennessee Value-Added and Research Center noted the standard errors when talking to the press: I know that the Nashville papers did NOT report either standard errors nor took account of that when reporting scores publicly.

In the 1983 report co-authored by Sanders on a feasibility study (using a forerunner of the TVAAS model) in Knox County, the vast majority of teachers fell within two times the standard error of the median, in all three grades and all subjects tested. In that case, it truly seemed to be true that the gain score methodology only distinguished the very worst and very best teachers from the mass of teachers.

=========================================================================

Date: Mon, 19 Dec 1994 19:58:04 CST
From: Rick Garlikov

In response to Anne Pemberton who thought the TCAP testing schedule was once every three years. No, it is every year, in the spring. TVAAS incorporates three years of data for all its assessments.

=========================================================================

Date: Tue, 20 Dec 1994 07:04:59 -0600
From: Sherman Dorn

Please accept my apologies for the prior post: I was unaware that the mailer would turn the attachment into gibberish.

The following are my thoughts about TVAAS and accountability after reading the articles and manuscripts available about it and after a lengthy exchange with a staff member of the Tennessee Value-Added and Research Center. The following are largely comments about the use of TVAAS for policy purposes (i.e., the evaluation of school personnel and programs). They are not comments about the research utility of the TVAAS database.

My concerns about TVAAS-as-evaluation focus around the following five issues:

The state legislature has confused the technocratic tool of TVAAS-as-evaluation with the fundamentally political task of program evaluation.
The state legislature locks program evaluation into a set of centrally-calculated, rather than grassroots-developed, statistics.
The state legislature locks educational program evaluation into summative, rather than formative, evaluation.
The creation of the TVAAS locks the state into a testing program (the Comprehensive Test of Basic Skills) that gives little useful feedback to teachers.
The state legislature has discriminated against individuals with disabilities in the exclusion of students eligible for special education from the TVAAS estimates of teacher effects.

The following will discuss each issue in turn.

The creation of TVAAS at the heart of educational evaluation demonstrates the legislature's confusion about the fundamentals of evaluation.
When the Tennessee legislature enacted TVAAS, it tried to put a statistical model (value added assessment) at the heart of evaluation. Essentially, it instructs the state board of education to gauge whether schools meet the "required rate of progress" on TVAAS in order to be in compliance with state policies (i.e., that schools should have a "mean gain for each measurable academic subject within each grade greater than or equal to the gain of the national norms" [Tennessee Code Annotated $49-1-601 (b) ]). If schools don't meet this rate of progress, the commissioner of education can place school systems on probation and can, if the rate of progress doesn't come up to speed, remove local board members and the local superintendent.
Similarly, the legislature allows TVAAS estimates of teacher effects to be used for formal personnel evaluation after three years of data exist.
Now, while I think that statistics are an important part of policymaking when used correctly, I do *not* think that it is intelligent to put a single mechanism at the heart of such decisions. Fundamentally, decisions about competence in administration (and teaching) are political. I suspect that any such decisions in Tennessee will eventually become a matter of political judgment, but even in that case educators will then see TVAAS through cynical eyes: "They said it was the numbers that are all, but look at who they decided to remove." It is much better to avoid policies that can easily be seen as hypocritical. We have enough of that in education as it is. If we are to have evaluation of school systems and teachers, it must be the *educators* who see the evaluation as legitimate. Will TVAAS encourage that? I'm not so sure; I have yet to see evidence of teachers' seeing TVAAS as acceptable in Nashville area schools. One principal told me that TVAAS made her school look good, but she didn't think it was a legitimate tool of evaluation.
And I even have some doubts about whether the judgments will be judgments and not just rote decisions based on cut-offs. Case-in-point: the state established a few hundred thousand dollars which the state board was supposed to distribute this year to schools who met certain goals, one of which included a 10% "dropout rate" or less. One of the schools it gave an award to was Hume-Fogg Magnet School in Nashville, which had an amazingly low dropout rate. Guess what? Magnet schools don't have dropouts; they send problem students back to their home schools. So setting up the criterion led to an unfair advantage for magnet schools, but the board decided it still had to follow the guidelines because they had established the goals explicitly. This may happen with TVAAS and probation decisions as well.
In short, setting a statistical mechanism at the heart of evaluation ignores the basic political fundament of evaluation and policy decisions. In the long run, it is not wise.
TVAAS as established by the state locks Tennessee into centrally-created, rather than grassroots-developed, statistics.
Essentially, the Tennessee Value-Added and Research Center does the statistical analysis and presents the information to the state by itself. With some added "help" by the media. (The Nashville Tennessean this year, by the way, printed two versions of the rank-ordering and scores of schools this fall. They had to print it a second time because they screwed up a lot of the scores the first time. Imagine what that does to teachers reading the paper on a Saturday, and then reading their school's scores the next day as different.) These numbers are not derivable by educators themselves and -- more to the point -- I have yet to meet a teacher which understands the information presented to him or her. This may be because I have a limited sample, or maybe it's biased towards teachers who don't have the proper information, or maybe teachers sense my skepticism and wouldn't speak to me if they really like TVAAS, but I suspect that it's because the numbers are created far away from them. They're not tangible, in the form that a test graded by them, or in the form of answers correct, would be. Again, this erodes the legitimacy of evaluation in the eyes of teachers -- and it's a problem with all centrally-administered tests.
In addition, the central creation of statistics also erodes the possibility of grassroots political activism to reform schools. If the core of Tennessee's reform system was -- to create a thought experiment -- the mandatory publishing by each local system of certain facts, and manuals distributed to local organizations and citizens of how to calculate their own statistics on a home computer, how might educational reform be different? The voters might *vote out of office* board members who didn't move to change things.
In reality, statistics are produced both by central administrations and locally. There should be a healthy mix, however, and the weighting of things towards central administration in TVAAS is unhealthy.
Let me give a contrast between Chicago school reform and the removal of school officials in East St. Louis. In Chicago, a broad alliance of people, from businesses to civil rights activists, were fed up with schools. They used a mix of statistical sources -- from standardized test scores to a report on high school dropouts created by Aspira, Inc. (a Hispanic civil rights organization) to demand first the ouster of the superintendent and then the wholesale reform of schooling in Chicago. The state legislature made the *explicitly political* decision to reform and restructure Chicago's public schools through a statute. The result (as far as the legitimacy of the reforms is concerned): the primary opponents of the reform were the school principals and administrators, many of whom lost jobs (and for the principals, lifetime tenure) through the reforms.
In East St. Louis, the state department of education decided to remove school officials in this largely-African American city for incompetence. In contrast with Chicago, the move was opposed by LOTS of different groups locally, with a commonly-voiced suspicion that some racial politics were involved in the decision to remove officials in this particular school system. The lesson, I think, is that you need pressure from both grassroots activists and others allied with them to create successful movements for school reform in the political arena. And I don't think you can get that grassroots activism without grassroots creation of statistics. People have to *feel* they know what's going on in schools. Numbers created far from them, which they cannot replicate, aren't enough. If they only have such numbers, the debate can turn superficial very quickly, when opponents can only spout numbers at each other instead of talking about what the numbers mean. In Chicago, a local civil rights group did its own research and created its own, alternative, dropout statistics. That is a far cry from what the state legislature has done in Tennessee.
The legislature has locked Tennessee into an evaluative model that is summative rather than formative.
When a state uses annual tests as part of an accountability mechanism, it is creating a summative evaluation system, where you make a decision about a program as a whole [do we remove the superintendent], instead of formative, where you can use the information to make programmatic decisions [what do we do now?]. The problem with annual tests is that, when you get the information back, the kids are gone. The teachers can't do anything to change their teaching methods for that group of kids.
(Okay, I suppose annual tests are formative if you're only looking at the style of the individual teacher. It is *not* formative, however, if you're looking at adjustments for the benefit of specific children.)
In contrast, there *are* formative evaluation systems that let teachers make instructional decisions many times during the year. Let me contrast TVAAS (or the CTBS, on which it's based) with two systems I'm familiar with, one in math and one in reading. The following table describes how often students take the test, whether the test provides information about subskills, and how many tests are needed before you can make a teaching change inductively:
```
            Frequency   Subskills?   Minimum to change
TVAAS/TCAP  annually       no            unknown
Math        weekly         yes               4
Reading     2x week        no                8
 
```
In the case of the *truly* formative evaluation systems, you can make a teaching change every month based on substantive, real data. If you want to make a teaching change on a particular group of kids based on TCAPs, you can't. In order to change teaching methods, you either need information about subskills during the year that's comparative across time, or frequent tests, or (preferably) both.
And that leads me to ...
The state legislature has locked the state into a testing system with little useful feedback.
Put simply, the state currently uses the Comprehensive Test of Basic Skills as its standardized test (or part of it, the part that's used for TVAAS). The only breakdown you can get from the test is by subject area: scores in math computation, math application, reading, social sciences, science (and maybe language arts and writing, I forget exactly). But if you're a teacher and wondering if your students aren't picking up division, you're out of luck.
And I suspect that, for a variety of reasons, such tests (without any item information coming back to teachers) will continue to be used in Tennessee, because the state is very keen on test security (and it interprets that broadly to include item information on feedback) and because the test used for TVAAS will always be standardized.
The current TVAAS discriminates against individuals with disabilities.
This discrimination comes in two forms: one is the explicit exclusion of individuals with disabilities from the estimation of teacher effects (and that's in the statute); the other comes from the frequent exclusion of individuals with disabilities from taking the standardized tests at all, and the implicit exclusion of individuals with severe disabilities from tests that they really can't take (and thus are excluded from any measure that would describe their progress from year to year).
First, this is a direct denial of benefits to individuals wtih disabilities. The state has decided that evaluation is a critical function of state government, as important to educational improvement as activities like building construction and textbooks. Thus, we should analyze the investment in this evaluation system in the same way we do for building construction and textbooks. Would it be acceptable for the state to create buildings that were only used by nondisabled students? Or for the state to purchase instructional materials that are only used or usable by nondisabled students? Absolutely not. That would be a clear violation of section 504 of the Rehabilitation Act of 1973.
Second, the exclusions create an incentive to overidentify students as disabled. If a teacher is worried about a student's progress, and knows that the student's progress will be part of the teacher's evaluation unless the child is certified in special education, the exclusion creates an additional incentive to refer the child rather than make one more attempt to accommodate her or his needs. (This will probably happen for children who are, in the opinion of the teacher, borderline.)
Third, the exclusions create an incentive to remove disabled students from regular classrooms. If a teacher thinks that a student is somewhat disruptive, and whose scores are unimportant anyway for TVAAS (or who probably won't take the tests), he or she may push to get the child out of the classroom "so that the other children can learn." This sort of action would clearly violate the right of disabled children to be in the "least restrictive environment" where they can be accommodated. Again, this would probably happen to children who teachers think are borderline acceptable as far as presence in the classroom is concerned.
Fourth, the exclusions create an incentive for regular education teachers to ignore the instructional needs of disabled students. If students are either unlikely to take the test or are by fiat excluded from the teacher's TVAAS responsibilities, the teacher may very well concentrate on those children "who matter." This could easily be a self-fulfilling prophecy as far as students with mild disabilities are concerned: regular teacher thinks child probably won't take the tests, teacher doesn't pay attention to child, child learns so little and whose confidence drops so much that the teachers convince the parents to exclude the child from the tests at the end of the year.
As far as these issues go, the state has at least three possible ways to eliminate this discrimination: (a) it could eliminate standardized tests and the TVAAS; (b) the gain scores of students who don't take the test could be arbitrarily set to zero; or (c) the state could invest in a parallel system of teacher supervision, as heavily funded as TVAAS, for individuals with disabilities.
Note that the elimination of the statutory exclusion of individuals with disabilities from TVAAS would *not* eliminate the discriminatory effects. First, some children will never take the standardized tests because of the severity or nature of their disabilities. Second, the exclusion of some children from testing at all retains all the perverse incentives described above.

=========================================================================

Date: Wed, 4 Jan 1995 16:21:37 CST
From: Rick Garlikov

This is *MY* response to a previous post; it is NOT a response from TVAAS. Rick Garlikov

Sherman Dorn said, then subsequently explained and argued for, the following:

"My concerns about TVAAS-as-evaluation focus around the following five issues:

"1.  The state legislature has confused the technocratic tool of
TVAAS-as-evaluation with the fundamentally political task of
program evaluation.
"2.  The state legislature locks program evaluation into a set of
centrally-calculated, rather than grassroots-developed,
statistics.
"3.  The state legislature locks educational program evaluation
into summative, rather than formative, evaluation.
"4.  The creation of the TVAAS locks the state into a testing
program (the Comprehensive Test of Basic Skills) that gives
little useful feedback to teachers.
"5.  The state legislature has discriminated against individuals
with disabilities in the exclusion of students eligible for
special education from the TVAAS estimates of teacher effects."

Regarding (1): I am not happy with the language used to express what Sherman's good examples show he means; and I will return to that after discussing what I understand him to mean. First, I don't think the state has at this point confused the issue, though there is a reasonable danger, in ways Sherman points out, that they will. TVAAS has already said in a previous post that their results are supposed to be used as INDICATORS that something is very good, or very wrong, about a school/district/teacher, not conclusive evidence. That leaves room for the kind of evidence to be given about the merit of schools/districts/teachers which IS reasonable to educators -- a concern that Sherman expressed in explaining (1). Keeping this system from degenerating into a system of rote cut-offs based on TVAAS numbers without use of judgment is crucial. But surely it should be possible to prevent such degeneration; and safeguards should be put into place to make certain it does not occur.

Sherman says "If we are to have evaluation of school systems and teachers, it must be the *educators* who see the evaluation as legitimate." I want to make certain this is not ambiguous; while educators must see evaluations as legitimate, they should not be the only ones who do. If educators are the sole arbiters of whether they are doing a good job or not, and which of them is competent or not, without having to give reasons that make sense to anyone else, I fear we would end up with the same kinds of problems we have with the AMA policing its own --which they tend not to do very much unless there is the most egregious and publicized ineptness, negligence, or malpractice. My experience with "ethics" or "licensing" boards in governmental, professional, and business organizations is that they end up with a set of guidelines they follow that are superficially ethical or professional sounding, but which, at any meaningful level, frequently have very little to do with normal notions of competence or morality. In fact, much of the public wants to know why some administrators cannot seem to identify incompetent or excellent teachers when left to do so on their own; and why some teachers are able to get degrees at all. While the criteria for competence in education certainly need to be reasonable to educators, they should also be reasonable to people who are not educators.

The reason why I am unhappy with the language of (1) is that I don't think program evaluation is fundamentally a political decision, though I think programs often are judged on political grounds (i.e., power, perhaps coupled with ideology), and perhaps even more often appear to be. That is true when evaluations are not reasonable or when they do not respond to valid reasons given in opposition to them. But normally, when people act in good faith to evaluate a program, they are trying to be reasonable, not simply asserting power. The fact that some programs are judged purely on political (power and ideological) grounds, and are therefore not really being reasonably evaluated at all, should not lead us to say that the proper way to do evaluations is politically. The proper way to do evaluations is based on reasons and judgment, not power and blind ideology.

Moreover, I thought TVAAS was established to judge how well policy goals are being met, not how worthy those policies goals are. From what has been said so far, TVAAS tries to ascertain how well certain subject content areas --as prescribed outside of TVAAS-- are being taught, not whether those subject content areas are important or whether "knowing" them in certain ways is important. It IS important that legislators, administrators, and the public understand that these other issues ARE crucial, and that TVAAS evaluations are at best no more meaningful than what they are (legislated to be) based on. The media should never report mere numbers or ratings -- though they will.

Regarding (2), I think what Sherman says is true and important here, but MORESO if instead of speaking only about the calculation of "statistics", the operating principle was about the expression of "reasons". Although statistics can give evidence, not all evidence is statistical in nature. Plus, as Sherman points out, debates in terms of "numbers" alone, rather than the reasonable meaning of those numbers, can quickly turn superficial. In Sherman's examples from the two different school districts, it seems to me the important features were whether the reasons given by a centralized authority were agreed with by the local residents, and vice verse. Without some sort of consensus, the issues do become politicized and embattled, turning on a struggle for power, rather than reasonable and peaceful resolution. I think that this way of understanding (2) links it closely with (1).

Regarding (3), I may be misunderstanding the point of TVAAS, but I thought it WAS established to give summative evaluations about teaching competence/progress, not formative evaluations about improving instruction. So that even if TVAAS cannot be used to improve instruction or pinpoint students' specific learning difficulties, that is not a point against it. It is important to identify problems or good resources, even if that identification does not pinpoint a remedy for the problems. The profession should try to provide the remedy, I would presume, not TVAAS.

Teachers are free to give tests which allow formative evaluations and instructional corrections. There is no reason the state needs to be giving these tests for teachers. The mission of TVAAS seems to have nothing to do with this aspect of teaching, so I don't see that as an actual objection against TVAAS.

Regarding (4), Sherman says,
            "the state currently uses the Comprehensive
            Test of Basic Skills as its standardized test
            (or part of it, the part that's used for
            TVAAS).  The only breakdown you can get from
            the test is by subject area:  scores in math
            computation, math application, reading,
            social sciences, science (and maybe language
            arts and writing, I forget exactly).  But if
            you're a teacher and wondering if your
            students aren't picking up division, you're
            out of luck."

But again, the purpose of TVAAS, I presume, is to evaluate how well teachers/schools/districts have done that aspect of their jobs involved in teaching content of whatever sort is measured on the Comprehensive Test of Basic Skills. And surely, shouldn't a teacher be able to figure out whether his/her students are "picking up division" or not without the state's having to do that for him/her. I take it that TVAAS is not about helping teachers/schools/districts do their jobs, but about suggesting, in some attempted objective way, to everyone how well or poorly they have done it.

=========================================================================

Date: Wed, 4 Jan 1995 20:12:39 -0600
From: Sherman Dorn

> This is *MY* response to a previous post; it is NOT a response > from TVAAS. Rick Garlikov

And here I was, worried that no one had read that post.

Regarding my concern that TVAAS will be used as a rigid evaluation tool:

>First, I don't >think the state has at this point confused the issue, though >there is a reasonable danger, in ways Sherman points out, that >they will. TVAAS has already said in a previous post that their >results are supposed to be used as INDICATORS that something is >very good, or very wrong, about a school/district/teacher, not >conclusive evidence.

With respect to the Value Added and Research Center (VARC) staff member who posted that information, I beg to disagree, based on the law. The relevant sections of Tennessee's education code (49-1-601 through 49-1-610, in the 1994 Supplement to Tennessee Code Annotated) clearly says that systems which perform subpar for several years are subject to probation. The code does mention dropout rates (undefined) as another criterion, and pulling local officials is at the discretion of the state commissioner of education, but there is no doubt what is the central force here: TVAAS statistics.

To be fair to the incoming governor, a Nashville paper reported yesterday that he was intending not to follow the provisions of the code regarding the removal of local educational officials. Of course, with the reputation local papers have for absolute accuracy, I'll still wait and see what happens. (But see what happens if the report is true: the governor would be explicitly ignoring a provision of the education code. Great.)

>Sherman says "If we are to have evaluation of school systems and >teachers, it must be the *educators* who see the evaluation as >legitimate." I want to make certain this is not ambiguous; while >educators must see evaluations as legitimate, they should not be >the only ones who do.

Agreed. But this makes it very clear that evaluation is a political task, in its underlying assumptions if nothing else.

>The reason why I am unhappy with the language of (1) is that I >don't think program evaluation is fundamentally a political >decision, though I think programs often are judged on political >grounds (i.e., power, perhaps coupled with ideology), and perhaps >even more often appear to be.

The fact that actions are political does not make them bad. You can decide, for example, to fire Ronald Jones on the basis that Jones' lowest-performing students didn't improve their ability to read at all during ten years in his class; but that has at its basis the political judgment that a teacher must pay attention to lower-performing students. This decision may anger parents of higher-performing students who loved Jones' teaching and how their children did. That's a political decision, not a technocratic one.

>Moreover, I thought TVAAS was established to judge how well >policy goals are being met, not how worthy those policies goals >are.

My contention is that TVAAS is an abdication of the legislature's need to make policy goals. It presumes that, by the installation of TVAAS, Tennessee's schools will improve. A technocratic solution.

Moreover, my concern is that TVAAS freezes the state into its current set of tests; it will be very difficult to change evaluation goals at this point. TVAAS is a policy flywheel.

Regarding summative versus formative evaluation:

>I thought it WAS established to give summative evaluations about >teaching competence/progress, not formative evaluations about >improving instruction.

Sanders and Horn's article in the Journal of Personnel Evaluation Education claims that one advantage of TVAAS is its formative potential, and its instructional-methods neutrality.

>Teachers are free to give tests which allow formative evaluations >and instructional corrections. There is no reason the state >needs to be giving these tests for teachers. The mission of >TVAAS seems to have nothing to do with this aspect of teaching, >so I don't see that as an actual objection against TVAAS.

This is a very interesting claim. I hope I am not misinterpreting, but is Rick suggesting that the state should evaluate program effectiveness to the extent of removing officials and firing teachers, but not encourage them to use formative evaluation?

Moreover, I think the high-stakes nature of TVAAS will drive other forms of evaluation from consideration, as teachers focus on the annual tests in the spring. Besides, there is NO GUARANTEE that a specific form of formative evaluation, even if it is tied to the curriculum, will mean students will perform better on the tests used in TVAAS. So it is, in fact, unfair to tell teachers, "well, you go train yourself in formative evaluation, use it to the best of your abilities, and hope that the students then do better on our annual tests. Oh, and no, the evidence you accumulate in your classroom doesn't count as much as the TVAAS."

In addition, formative evaluation is alien to most teachers, as far as guiding instruction is concerned. Getting teachers to use formative evaluation would require a very different type of support structure, and quite a bit of money. It conflicts with TVAAS simply because it would compete with the VARC for funding. (And the next governor has promised he won't raise taxes.)

Regarding the usefulness of feedback of the tests (and my claim that TVAAS freezes the state into using these tests): >But again, the purpose of TVAAS, I presume, is to evaluate how >well teachers/schools/districts have done that aspect of their >jobs involved in teaching content of whatever sort is measured on >the Comprehensive Test of Basic Skills. And surely, shouldn't a >teacher be able to figure out whether his/her students are >"picking up division" or not without the state's having to do >that for him/her.

I'll defer here to William Webster, head of the evaluation unit in the Dallas schools, who told a CREATE audience a few years ago that one of the requirements of top-down evaluation of teachers was that teachers be provided with explicit, helpful feedback.

>I take it that TVAAS is not about helping >teachers/schools/districts do their jobs, but about suggesting, >in some attempted objective way, to everyone how well or poorly >they have done it.

This is precisely what it is, and I think Tennessee law gives an unwise emphasis to one type of evaluation instrument.

=========================================================================

Date: Wed, 4 Jan 1995 20:52:42 CST
From: Rick Garlikov

Here is TVAAS's response to Sherman Dorn's first post on the policy aspect of TVAAS.

----------------------------Original message----------------------------

In response to Sherman Dorn's most recent critiques of the Tennessee Value-Added Assessment System (TVAAS), the UT Value-Added Research and Assessment Center offers these remarks as a complement to those presented by Rick Garlikov:

On Tue, 20 Dec 1994, Sherman Dorn wrote:

> Please accept my apologies for the prior post:  I was unaware that
> the mailer would turn the attachment into gibberish.
>
> The following are my thoughts about TVAAS and accountability after
> reading the articles and manuscripts available about it and
> after a lengthy exchange with a staff member of the Tennessee
> Value-Added and Research Center.  The following are largely
> comments about the use of TVAAS for policy purposes (i.e., the
> evaluation of school personnel and programs).  They are not
> comments about the research utility of the TVAAS database.
>
> My concerns about TVAAS-as-evaluation focus around the following
> five issues:
>
> 1.  The state legislature has confused the technocratic tool of
> TVAAS-as-evaluation with the fundamentally political task of
> program evaluation.

TVAAS is a statistical model for program evaluation. It is not a "technocratic tool," as Dorn so colorfully phrases it. It was not developed by the State of Tennessee but by Dr. Bill Sanders, a statistician, for the purpose of addressing problems previously encountered in using student achievement data in educational assessment. The State of Tennessee adopted it as the model for such assessment because TVAAS was able to supply valid, reliable, unbiased data based on student gains, data the State thought was important. The Education Improvement Act (EIA) allows the use of TVAAS data as a *component* of the evaluation of educational entities--teachers, schools, and school systems. It specifically states that TVAAS may not be the only source of data in such evaluations. Teachers are further protected in several ways. First, since the law mandates the use of at least three years of data for a formal evaluation (although fewer years may be used for informational purposes), only tenured teachers will receive TVAAS reports that can be used for assessment purposes. Second, the EIA specifically states that teachers may not be dismissed exclusively on the basis of TVAAS data. Third, teachers are evaluated by several different models including performance assessment by local supervisors and principals. These evaluations are mandated by the State. Fourth, teachers may elect to be evaluated intensively to achieve advanced Career Ladder status. Career Ladder evaluations include dialogs, presentation of artifacts, and repeated classroom observations.

TVAAS data is currently available only for grades three through eight in the subject areas of science, social studies, math, reading, and language arts. This means that many teachers do not have the advantage of TVAAS data. Subject-matter specific tests for grades nine through twelve are now being developed by CTB/McGraw Hill in conjunction with committees of Tennessee secondary school teachers who are devising questions in their areas of expertise. However, it is unlikely that TVAAS will ever assess ALL teachers in ALL subjects. Therefore, it will never be the ONLY means of assessing teachers or schools. TVAAS merely provides measures of student academic gains, surely a useful component of any educational assessment system.

As for Dorn's projections of what "might" happen, we see no basis in reality for such dire prophecies. The link he attempts to establish between the problems associated with fairly rewarding schools for improvements in the dropout rate and TVAAS is nonexistent.

Finally, we disagree with Dorn on the political nature of assessment. Assessment should be a scientific process rather than a political one. Although setting the criteria for assessment is a political act, assessment itself must be scientific in orientation, centered on the real meanings of the most common of assessment terms--reliability, validity, and fairness.

> 2.  The state legislature locks program evaluation into a set of
> centrally-calculated, rather than grassroots-developed, statistics.

The State of Tennessee has a vested interest in monitoring the health of one of its largest and most important endeavors--the education of its children. It has chosen TVAAS as a means of carrying out its oversight responsibilities because TVAAS fairly estimates the academic growth of students, an indicator recognized as important by teachers, parents, local administrators, community leaders, and state officials, alike. Other indicators--dropout rate, attendance, promotion rate, and graduation rate--are also used by the state for assessment purposes. As stated above, a variety of methods are used state-wide for the evaluation of teachers.

TVAAS is a sophisticated statistical process that is beyond the capacity of many to grasp mathematically. However, TVAAS data is reported in a manner that is comprehensible to any educator willing to look at the reports and graphs supplied, with explanations, specifically for the purpose of rendering the data USEFUL. Judging by reports from many schools and systems across Tennessee, TVAAS data is now extensively used for curriculum planning and development.

> 3.  The state legislature locks educational program evaluation into >
summative, rather than formative, evaluation.

For further clarification of Mr. Dorn's point, we quote him from a paragraph in which he discusses this point, further on in this same composition. Mr Dorn writes

"When a state uses annual tests as part of an accountability mechanism, it is creating a summative evaluation system, where you make a decision about a program as a whole [do we remove the superintendent], instead of formative, where you cn use the information to make programmatic decisions [what do we do now?]. The problem with annual tests is that, when you get the information back, the kids are gone. The teachers can't do anything to change their teaching methods for that group of kids. (Okay, I suppose annual tests are formative if you're only looking at the style of the individual teacher. It is *not* formative, however, if you're looking at adjustments for the benefit of specific children.)"

First, we would direct those following this topic to Rick Garlikov's excellent discussion on this point. We add only a few comments:

The State of Tennessee considers TVAAS to be a model for both summative and formative assessment. Education is an ongoing effort, and the end of a school year is simply a point in a continuum. TVAAS is used, as stated under (2) above, to assess how well students are progressing under current practices. If they are making expected gains, we assume that the teacher, school, or system is providing effective instruction. If not, then adjustments should be made. When systems are doing very poorly indeed, they are placed under supervision and are more closely monitored by the state. While under probation, they are expected to make progress toward acceptable standards--a formative process directed by many indicators, one of which is student gain.

TVAAS provides breakdowns of student gain data to schools and school systems by school, grades, subjects, and achievement levels of student (this last, upon request from the Department of School Accountability). With all this information, schools and systems can easily pinpoint problems and successes and make specific policy decisions based upon this knowledge. The data provide guidance as to "What we do now."

The State of Tennessee leaves the day-to-day adjustments in teaching strategies to the classroom teacher. In the last four years, Tennessee has conducted a concentrated effort to deregulate its schools, placing great emphasis on site-based management and school-based decision-making. In dispensing with the rules and regulations that have, in the past, served to dictate how schools may operate, Tennessee has placed the responsibility for program development, allocation of resources, the organization of schools, teaching strategies, and many other things in the hands of local administrators. Because this is so, TVAAS was adopted to monitor the educational process, not the "style," as Dorn puts it. The "style" in which education is conducted is not prescribed. The process of education can be completely individualized. TVAAS merely assesses the value added, the product of educational efforts, in the grades and subjects for which we have appropriate linear measures. As Garlikov points out, it was never designed to be otherwise. And as we have repeatedly stated, no one assessment model will suffice for all legitimate assessment purposes.

> 4.  The creation of the TVAAS locks the state into a testing program
> (the Comprehensive Test of Basic Skills) that gives little
> useful feedback to teachers.

As we have repeatedly told Dorn, TVAAS does not lock the state into any testing program whatever. TVAAS can use any assessment data that provides appropriate linear measures. In other words, TVAAS requires scalable data because it assesses progress over time.

TCAP was in place before TVAAS was ever adopted, and state-wide annual tests have been used in Tennessee, as in other states, for years. TVAAS simply utilizes data in a new way, furnishing far more useful information to teachers and schools than was ever possible in the past.

> 5.  The state legislature has discriminated against individuals
> with disabilities in the exclusion of students eligible for special
> education from the TVAAS estimates of teacher effects.

Please see Dorn's discussion of this item, below.

First, Dorn refuses to acknowledge, as we have repeatedly told him, that low-achieving students are at least as likely to achieve appropriate gains as their higher-scoring classmates (based upon three years of state-wide data), so there is no logical reason to exclude them from the tests.

Second, we find it most disturbing that Dorn attributes such unethical behavior to Tennessee teachers as to ignore students in need of their tutelage and to connive to have their parents classify them as special education and to encourage them to avoid testing.

Third, we have suggested to Dorn that if he has any real knowledge of such occurances that he refer them to the Department of School Accountability. There is a mechanism (to use Dorn's terminology) in place to insure that special education and other students are not exploited in this manner.

Fourth, Dorn is obviously unaware that the funding for TVAAS is miniscule and that supervisory personnel are already available in every system for the purpose of overseeing teacher conduct and, in particular, special education programs. Perhaps he believes that they are possessed of the same iniquities he attributes to teachers and cannot be trusted with their appointed duties.

In contrast, it is our supposition that teachers and administrators are dedicated to the same goal that the State of Tennessee espouses: the improvement of education for all students in order to maximize the potential of each of them. We sincerely hope that TVAAS is a means by which this goal can be more readily achieved.

We appreciate the opportunity to respond to these issues.

=========================================================================

Date: Wed, 4 Jan 1995 23:12:35 -0600
From: Sherman Dorn

The UT Value-Added Research and Assessment Center wrote a lengthy response to my long comments several weeks back about the Tennessee Value-Added Assessment System. (Please accept my apologies for the incorrect name of the center in other posts.) I will try to confine my remarks to central issues and matter of potentially discriminatory effects on children with disabilities.

Regarding evaluation and my claim of its political assumptions:

>        we disagree with Dorn on the political nature of
>assessment.  Assessment should be a scientific process rather than a
>political one.  Although setting the criteria for assessment is a
>political act, assessment itself must be scientific in orientation,
>centered on the real meanings of the most common of assessment
>terms--reliability, validity, and fairness.

This is the core of our disagreement. TVAAS assumes that assessment should fundamentally be a scientific process. I assume that evaluation's political nature will outweigh any claim to positivistic science, *especially* if it's asserted as such.

(Note: this does not mean that I am anti-empirical, or that VARAC staff are against using other materials for evaluation. This is an argument about the orientation of evaluation.)

Regarding the origins of statistics:

>The State of Tennessee has a vested interest in monitoring the health of
>one of its largest and most important endeavors--the education of its
>children.

The question here is NOT whether the state should produce any statistics, but whether its creation of statistics should dominate the debate. I think it's much healthier for public debate to include statistics from various sources and levels of government and often from outside government as well. I think the Educational Improvement Act puts TVAAS into an overwhelmingly dominant position.

Re: my claim of potentially discriminatory effects on individuals with disabili

>so there is no logical reason to exclude them from the tests.

Information on low-achieving students as a group is NOT information about individuals with disabilities. Moreover, my concerns deal with the incentives for flawed, stressed educators to try to slough off some of their responsibility

>        Second, we find it most disturbing that Dorn attributes such
>unethical behavior to Tennessee teachers as to ignore students in need
>of their tutelage and to connive to have their parents classify them as
>special education and to encourage them to avoid testing.

Most unethical behavior in institutional settings occur not because of conscious and evil conniving but because the circumstances and work habits combine to create incentives for unethical behavior. Besides, the existence of accountability frameworks such as TVAAS presumes the irrationality and untrustworthiness of teachers (else why would we need to hold them accountable?).

>        Third, we have suggested to Dorn that if he has any real
>knowledge of such occurances that he refer them to the Department of
>School Accountability.  There is a mechanism (to use Dorn's
>terminology) in place to insure that special education and other students
>are not exploited in this manner.

Such evidence would have to be very clear; the hearsay testimony from a non-tenure track academic about the supposed intent to pressure parents into removing their children from testing would not cut it. However, I do not have to make a hard case of specific instances in order to make a pretty good case that Tennessee educators tend to exclude kids with disabilities from testing. According to the National Center on Educational Outcomes, fewer than 50% of students with disabilities in Tennessee participated in state testing in 1992. Some states have much, much higher participation rates. My conclusion: Tennessee schools just don't try hard to include students with disabilities in testing.

>        Fourth, Dorn is obviously unaware that the funding for TVAAS
>is miniscule

Compared to the entire education budget, certainly. However, there is NO such comparable accountability system with quantifiable tracking of student outcomes for children with disabilities; when looked at in that frame, individuals with disabilities receive the short end of this sytem that the state legislature has put at the heart of educational reform in Tennessee. If quantifiable tracking is good for nondisabled students, why doesn't a system exist that will accommodate the requirements of students with disabilities?

This is a question for the state legislature; I assume that VARAC staff would have no objections to such a parallel system that overlaps TVAAS, and that they have a great incentive for Tennesse to put in place preventive measures against discriminatory effects. Regardless of our differences over educational evaluation, I assume we can agree on this.

=========================================================================

Date: Wed, 4 Jan 1995 23:10:05 -0600
From: Alan Davis

I find that I agree with all of Sherman Dorn's criticisms of the TVAAS evaluation system. All evaluation is political, because the process and findings of evaluation affect power and resources. The interests and assumptions of some are reflected in the questions the system are designed to answer and the test data selected to answer them; the interests and assumptions of others (teachers, in particular) are not. The fact that the system seems to rely on objective evidence of learning and is statistically sophisticated does not change the essentially political nature of the process; by cloaking the enterprise in the garb of science, teachers may be doubly intimidated and policy makers impressed, but nonetheless you have here a summative evaluation conducted far from the site of local decision making that is likely, in my view, to have pernicious effects despite the very sincere intentions of its designers to be fair.

I am open to the possibility that TVAAS may be a valuable tool for research. I believe that good teaching is teaching that results in significant learning for a broad spectrum of students. At the elementary level, I will go so far as to agree that the sort of standardized multiple choice measures designed by CTB will detect improvements in reading comprehension and mathematical conceptual and problem solving understandings that most of us want for our children. I have used regression residuals from hierarchical linear models of learning outcome measures regressed on pretests and SES to select individual teachers to study in my own research on effective teaching. The main problem, I believe, is that the multiple choice measures used in Tennesee are at the outset poor proxies for the tasks we really want kids to be able to perform in school, and when we attach consequences for teachers around socres on these measures, very bad consequences are likely to result.

At the same time, students' performance on the tests may have consequences for teachers. At the very least, someone is forming judgments about what teachers are good and what teachers are not so good based upon their students' performances on these tests. Linda McNeil, Lorrie Shepard, Mary Lee Smith, and others have documented the behavior of teachers under these circumstances. They concentrate on raising the scores rather than teaching to the broader goals of instruction, and the strategies they hit upon to raise scores involve adjusting instruction to model the format, content, and mindset of the multiple choice tests, sometimes to the point of teaching actual passages and items.

The problem, which I do not believe can be avoided, is that the validity of tests is inseparable from the uses to which they are put. A test that may be valid as a research tool to discover effective teaching will lose its validity when teachers suspect that the outcome has consequences for themselves.

=========================================================================

Date: Fri, 6 Jan 1995 19:22:04 CST
From: Rick Garlikov
Subject: TVAAS/Harvey Goldstein

Harvey Goldstein stated on Dec. 19, 1994, "..it turns out that these [standard errors] are typically so large that you cannot make any statistically significant comparisons between most of your schools...only those at opposite extremes of a ranking. Is this also the case in Tennessee? If so what do you do about it when reporting?"

From TVAAS:
Below are listed the mean gains for math with their standard errors for schools within one of the larger school systems in Tennessee. These means are three year averages and were calculated from the TVAAS mixed model process. This should give an idea of the sensitivity of the process.

            INTERMEDIATE SCHOOLS
 
        GRADE   SCHOOL    MEAN   STD. ERR
                RANK      GAIN
 
           3        1   71.596     4.931
           3        2   71.205     3.714
           3        3   67.038     2.624
           3        4   66.876     3.641
           3        5   62.734     3.427
           3        6   62.574     6.906
           3        7    62.14      5.17
           3        8   62.032     3.628
           3        9   61.713     3.534
           3       10   59.062      4.04
           3       11   58.096     3.337
           3       12   57.262     4.849
           3       13    57.22     2.321
           3       14   56.909     5.552
           3       15   55.062     3.866
           3       16   54.775     3.301
           3       17     53.8       2.4
           3       18   53.553     4.459
           3       19   53.452     2.813
           3       20    52.06     2.074
           3       21    51.24     1.411
           3       22   50.853     2.909
           3       23   49.298     3.716
           3       24   48.654     3.134
           3       25   48.269     4.603
           3       26   47.889     3.141
           3       27   47.884     3.714
           3       28   47.864     2.931
           3       29   47.574     5.398
           3       30   46.382      3.73
           3       31   46.281     1.664
           3       32   44.722     3.622
           3       33   43.843     2.533
           3       34   43.272     3.417
           3       35   42.965      2.47
           3       36   42.907     2.443
           3       37   41.844     2.803
           3       38   41.645     2.772
           3       39   41.018     1.859
           3       40   40.079     2.649
           3       41   39.492     2.958
           3       42   38.662     4.633
           3       43   37.784     2.807
           3       44    35.84     3.517
           3       45   33.597     4.541
           3       46    32.46     5.394
           3       47   31.586     3.868
           3       48   30.907     3.329
 
 
                 MIDDLE SCHOOLS
 
           7        1   25.894     0.971
           7        2   23.758     0.861
           7        3   23.134      1.17
           7        4   22.988     1.275
           7        5   22.738     1.347
           7        6    18.51     1.215
           7        7   18.185      0.99
           7        8   16.831     1.071
           7        9   16.268     1.075
           7       10   15.843     1.374
           7       11   15.394     1.055
           7       12   15.357     1.077

Additionally, there was a question concerning how TVAAS deals with missing scores. We will write a longer respose to this question later. But, briefly, it is more like the analysis of repeated measures. However, we do include all of the scores among subject-grade combinations. This is certainly sufficient and avoids the issues and problems associated with imputations.
Date: Sat, 7 Jan 1995 04:43:10 CST

Rick Garlikov:
Sherman Dorn, Alan Davis, and Mark Fetler all make important and cogent points about the nature of assessment and assessment tools. But I am arguing that characterizing these points by the term "political" is a mistake, and it is a mistake that severely weakens their message and the force behind that message.

There should be no doubt that assessment is not purely a matter of science, and that it is both a value-laden enterprise, and one subject to misleading scientific/mathematic measurements and the use of those measurements. But these issues are not "political" issues. ...

By using the word "political" to characterize the nature of the technical aspects of TVAAS, or any other scientific/mathematical/statistical sorts of assessment tools/indicators, you undermine your credibility with a government and public who believes, and I think rightly so, that surely there are some sorts of objectively recognizable contributions that teachers make or don't make to students' educations. To claim that all teacher evaluation, by its very nature, is "political" SEEMS to make the claim that there are no objective standards for discussing/judging teaching; and it makes it sound as though you are arguing that any judgment of teaching ability is worthless and unnecessary -- that no teacher is REALLY good or bad; they are all just popular or unpopular, so there is no fair way to judge any of them.

It seems to me that what you want to claim, especially if you go public, or lobby legislators with your points, is that there are many aspects, not only to good teaching, but even to the teaching of content, that are not reflected in the tests that TVAAS analyzes -- for the reasons that you give, and that the studies you cite show. And that though there is merit in the statistics that TVAAS produces (assuming that is true), those particular statistics are not what necessarily determine a teacher's value or contribution, any more than any one sport statistic determine's a player's value or contribution. People will understand and accept that (at least as a possibility). They will not accept your saying that scientific measurement or statistics is political any more than they would accept your saying that "batting average" in baseball is merely political. Neither of these things is "political" in any normal sense of the word. You are stretching the use of the word; and in doing so, you are losing most of your audience since most people will neither understand nor share your meaning. And they will think you are saying there can be no criteria of any sort for legitimately distinguishing good from bad teaching.

From: Rick Garlikov

The example Sherman gives of the firing/keeping of a teacher who teaches high-achievers well and low-achievers poorly as being a political decision seems to me to conflate a number of possibilities here. It is a political decision IF the wishes and needs of the aggrieved party are ignored simply because they are powerless to cause anything of consequence to the people making the decision. But it is not political if the teacher is replaced by a teacher who is able to help both groups of students. Nor is it political if a way is found to utilize the skills of the current teacher while also doing something to improve his/her skills with the other students, or to divide teaching assignments in other ways so that both sets of students get teachers who can help them the most. Not every decision is a political decision - -even if some of the results are the same as a purely political decision would be.

Now, since I don't understand what exactly Sherman is attributing to me with regard to the question of TVAAS's impeding the use of formative testing in teaching, I think there is some sort of misunderstanding. So let me try again.

Sherman had said that since TVAAS and the test it utilized did not give formative feedback in, say, students' long-division skills, it did not help teachers know whether their students were "getting" long division at a time that knowledge would be helpful to better teaching/learning. My response was meant to say that this did not matter --as long as TVAAS did nothing to actually impede good teaching --in this case, presumably, the use of formative testing by a teacher in order to figure out whether students were understanding or correctly doing long-division or not. The state does not need to train/foster/force/guide good teaching techniques; it just needs not to impede their usage. By analogy, serving good lunchroom food to teachers does not improve their skill at learning whether their students are improving at long-division either; but there are other reasons for serving good food. Similarly there are reasons for TVAAS apart from whether it serves to help teachers identify --as they are going along-- whether students are learning or not.

If the test that TVAAS uses measures skill in long-division, then presumably teachers will have a self-interest in teaching long division well, and in learning to use whatever methods will do that most reasonably. But the state, and certainly TVAAS, does not need to explain to teachers what those methods are. Especially if TVAAS is pointing out which teachers seem to be using good methods. An environment does not need to be helpful in ALL ways as long as it is not harmful in important ways.

Now, there will undoubtedly be teachers, just as there are students, who will try to find some shortcut, easy method to get good superficial results even if that means not doing what is best in the long run. But that seems to me to be a system problem only if the test or the statistical use of the test is so bad as to allow or reward, and therefore encourage, that. TVAAS is supposed to be able to pick this out even, not foster it. If Sherman has some reason to believe the test or the statistical package encourages shortcut, easy-way-out "teaching", then that is important to know. But his claim was that testing for ability in long-division at a time too late to correct inability, somehow precluded teachers from giving such formative testing in their everyday teaching. That just seems clearly false to me. The two things are at most unrelated, but I would suspect that generally something like TVAAS --if the test used is not vulnerable to teaching "toward" in a superficial way-- would encourage teachers to want to learn the best teaching methods, not the easiest bad ones. For Sherman's point to be successful, it seems there needs to be some evidence that TVAAS and the test it uses does, or likely will, adversely affect how teachers teach. The fact that some testing is high stakes testing, and the fact that some high stakes testing is counterproductive gives reason to be initially concerned; but it does not give evidence that TVAAS will actually be high stakes testing or that, if it is, it will impede rather than enhance good teaching.

Finally, it seems to me that the concern that TVAAS statistics will be misunderstood and misused by administrators, the legislature, the media, and the public, is best remedied by informing those groups in ways they understand and can use; not by trying to eliminate one kind of indicator from being used at all. Surely, though they resist it often, the media and the legislature can be made to understand some things -- and that one of them is that wise judgment needs to be based on all sorts of available evidence and not just on one statistic. As I argued in regard to my baseball analogy of a previous post, ordinary people do understand, when examples and reasons are given, that one statistic alone is not likely to indicate the overall value of a person's contribution to his/her profession. This is not a difficult concept to get even a legislator or a newspaper editor to understand, if it is explained in the right way. The legislature may even be persuaded to require certain kinds of clearly visible, audible, and understandable information about how to use TVAAS statistics whenever those statistics are reported in the media.

I think there are some important concerns about this enterprise that have not yet been raised, and I intend to raise them soon if no one else does; but I think they can be remedied too. I am not trying to defend TVAAS or the current use of it against all charges; but I think that, if the statistical methodology really is reliable and there are tests it can use which really do give some indication of what students collectively (or on average) have learned in a content area from a teacher, and if this information can be gathered relatively inexpensively and quickly, it is important to use the program and to simply be vigilant in keeping it from being misused.

From: Sherman Dorn

In Rick's response recently:

>For Sherman's point to be successful, it seems there needs
>to be some evidence that TVAAS and the test it uses does, or likely will,
>adversely affect how teachers teach.  The fact that some testing is
>high stakes testing, and the fact that some high stakes testing is
>counterproductive gives reason to be initially concerned; but it does not
>give evidence that TVAAS will actually be high stakes testing or that, if it
>is, it will impede rather than enhance good teaching.

This is an interesting counterpoint to the VARAC's argument that assessment should be scientific: don't we require "treatments" to be VERIFIED as both harmless and beneficial before we let them loose on the general public? There was no such experimentation with TVAAS (unless people consider it an ethical experiment to subject several hundred thousand children to the unknown effects of an untried evaluation system).

In reality, of course, there is little opportunity for such experimentation in policy, and it would misrepresent the purpose of policy (which is often to creat NEW systems -- and they're going to be new for someone at some point). But I am very concerned by the switch in burden of proof -- Rick is suggesting t despite evidence of some disturbing effects of high-stakes testing, we should le systems go ahead unless someone else can demonstrate concrete effects. This represents a rather high standard for opposing a new policy -- after all, we can demonstrate concrete effects until the policy is in place.

Also, let me answer Rick's questions about formative versus summative evaluation with an example: Suppose Renee Jones uses formative evaluation in her fourth-gr class for math and reading, and she responds in appropriate ways (changing instr when it seems appropriate, either for the entire class or parts of the class). the kids know everything fairly well, though three kids have only partially mast long division by the time of the annual tests. Later in the year, TVAAS gives h mediocre scores in math. How is she supposed to respond to this? Should she de that she should have concentrated on long division? Should she concentrate on o skills? Should she junk the formative evaluation system, since it obviously did help her? Remember, she has only three chances to improve her TVAAS scores in math before they can be used in personnel evaluation, and she has to make the decision for the ENTIRE year. And, the test for TVAAS doesn't give any extra information.

Now, I don't know whether TVAAS is consistent with other measures of school achievement -- thus far, the only study I am aware of that preceded the state's imposition of TVAAS was a mid-1980s test of the feasibility of the mixed- model methodology. I know of no empirical comparisons between TVAAS estimates of teacher effects and other measures of student gain. So let's choose another, more likely scenario:

Renee Jones does *not* use formative evaluation currently. For the past several years, she has tried various measures to "improve test scores," as her principal is constantly saying to the staff at her school. None have worked, a as Tennessee started to use the CTBS, her frustration has just increased with th lack of concrete feedback and the test scores that are at variance with her admittedly subjective judgment of the children's competence. Now comes TVAAS, and the stakes have taken a quantum leap upwards for her. The first set of TVAAS scores for her as a teacher this year looks lousy. She's very nervous, and doesn't really know what to do. She's considering several options: adopting a math program that's patterned on "Hooked on Phonics," with moderately expensive tapes but all planned out for the teacher; working with other teachers at her school in an "experiential math curriculum" (which they haven't yet desig spending most of her math time in the computer lab to try to teach using the CAI software she can spend her allotted class material funds for; or investing t time to learn a formative evaluation of math (which requires a lot of time and m test scoring and the tests). Again, of these FOUR options, she has to choose ONE for a single year. Also remember, she has a personal history of trying various measures to improve her test statistics and of the test scores being at variance with her own judgment of students' competence.

It is very likely, in my personal contacts with teachers and knowledge of resear and gut instinct, that Renee Jones will dismiss the formative evaluation system because (a) she has no reason to believe it will improve her test scores any mor than anything else she has either tried or is considering trying; and (b) it is resource-intensive, moreso than other possibilities. In this case (and this is assuming teachers are exposed to the idea of formative evaluation), the high-sta environment of TVAAS can drive a teacher away from a way of checking kids' performance that allows MORE flexibility for teachers. That is why I described TVAAS as in competition with formative evaluation, and why I believe states should instead be explicitly supportive of it.

__________

From: Alan Davis

Rick, Let me try to disentangle some of the arguments about the "political" nature of the TVAAS evaluation system.

My point that evaluation is a political act was not meant to be a criticism of TVAAS, or to argue that evaluation shouldn't occur. It was instead a recognition of the fact that when you decide to evaluate a publicly funded activity, such as teaching, the act of evaluation has consequences for the prestige, power, and resources of various actors. When one thinks about the information generated in the process of conducting an evaluation, one needs to distinguish the question, "Is this information accurate?" from the question, "Is this evaluation valid?" The latter question requires one to examine the interpretations that will be given to the data, the interests that will be served, and the consequences associated with providing limited data addressing a limited range of questions.

There are several problems with the validity of an evaluation system that compares teachers on the basis of the measured learning of students.

The system only credits learning measured on the test. It tells nothing about curiosity, work habits, or the social climate of the classroom.
The test is very limited. It correlates only moderately with the quality of students' written work, the ability of students to tell about what they have read, or the ability of students to conduct investigations, for example.
Students have little motivation to do well on the test. It is external to anything else they do in school.
Once teachers become aware that they are being compared on the basis of test scores, they are pressured to bring the scores up. To do this, they are likely to (a) stop spending time on things that are not on the test -- things like art, music, science projects, written stories and reports, and class discussions; (b) provide students more practice answering all sorts of things in the format of the test -- multiple choice; (c) teach the particular spelling words and vocabulary words that appear on the test in place of ones that are not; and (d) grow to hate the test and respond in a mode of resistance/compliance rather than in a pursuit of professional excellence.

Rick, as one who has in the past argued vehemently against the validity of multiple-choice assessment to evaluate student learning on this very list, I have a hard time understanding why you are intrigued by the use of the same tests to evaluate teachers. No one attempts to reduce the contribution of doctors or psychologists to a numerical score. Teaching is every bit as multi-faceted and context-specific, yet policy makers want to find a way to reduce teaching to a bottom line. I believe there is no way to do this that is actually beneficial to the education of children in the long run.

From: Rick Garlikov

Alan,
I agree with the concerns you list. And I have others along those lines, that I will express.

What I disagree with is calling what those concerns address to be the "political nature" of evaluation. (My mail got scrambled today, and I sent a response to Brian's post before getting yours; but I think it addresses this again. My point is not against your concerns; it is against calling the problems you discuss "political". I don't think that helps you make your case with anyone except those who happen to use "political" in this same, unusual way.)

TVAAS has made it quite clear that they are only providing one indicator of possible teaching problems or excellence. And they are apparently going to some lengths to keep tests from being predictable in certain ways. Sherman feels there is reason to believe that in spite of TVAAS intentions and explanations, forces are at work in the legislature and in ed administrative offices to misuse the results, considering them alone; and that teachers will be pressured into teaching in whatever ways they can "to the tests." Part of the reason for my wanting to discuss TVAAS was to see if there were not some ways of articulating what is going on, what should be going on, and what is likely to be going on, in ways that will be useful to the teachers and people of Tennessee to try to make sure there are pressures to have the "right" things going on regarding TVAAS and teacher evaluation in general. I AM concerned there is not enough of the right sort of information being given teachers, administrators, the media, and the public. But I think that can all be addressed and remedied.

I never thought teachers should be evaluated solely by means of a multiple choice test of their students. But I think it may be a good indicator of where to look for good or terrible teaching to confirm what TVAAS shows. When I was an academic adviser to freshmen and sophomores at Michigan, incoming freshmen were given Benno Fricke's OAIS test, which measured a number of things which were reported to advisers. I never saw one of those tests be wrong on a student, BUT I also never took any of those tests as a sure thing. They were merely to be used as an indicator of student academic interest areas and of students' motivation and emotional maturity levels. Low scores were simply red flags to be alert to certain kinds of possible problems developing. As an advisor I wanted to see other evidence and talk with the students to see whether the OAIS scores might be significant or not.

The same sort of thing should be done with TVAAS, and there is no reason that cannot be made clear to everyone. The OAIS scores were very helpful to good counselors and just because bad counselors might misuse them, that does not mean the test should not be given and scores reported.

The issues that need to be guarded against are those you bring up. But I think that can be done. And, if so, then TVAAS can be helpful in those cases where administrators in the past seem unable to detect or remedy problem areas on their own. It would be very difficult for a superintendent to continue to tell everyone it is not his system's fault they have poor GAIN scores, if other systems similar to his in all other respects have good GAIN scores. Further, one of the things I like about the TVAAS approach is that culturally advantaged districts with high achievement scores will not necessarily show the best GAIN scores --and in areas of Alabama I am familiar with, that would be a great thing, since many of the supposedly better systems rest on their students' backgrounds rather than the school districts' contributions to their education. They don't help students gain all that much; and that is important to point out.

Doctors and hospitals ARE partly assessed in certain kinds of numerical scores --mortality rates being one important one. But again, these are used as indicators, not sole factors.

I am not only opposed to grading students by multiple choice tests; I am opposed to grading them at all; and to grading them by means of any formal tests. I am opposed to grading teachers too. But TVAAS is NOT INTENDED to grade teachers or districts; it is intended to point to probable problems (and probable successes). I believe in assessment, not grades. And I believe that people have strengths and weaknesses which can be assessed in various ways. I try to assess my students' strengths and weaknesses so that I can help remedy their weaknesses, or so that I at least present the right sorts of material in the right ways --stuff they can assimilate and deal with in some feasible ways.

Education is one of the professions that has seemed rather weak in setting or maintaining high standards for itself. Many good teachers understand the incompetence of some of their colleagues who nonetheless receive tenure. Many parents, for good reasons, as well as those parents with bad reasons, think certain teachers are really incompetent. There seems to be a need for some sort of independent indication of who is doing a good job and who is not -- teachers and administrators.

Finally, one of the problems I have about testing students summatively is that it is not quite clear to me that a given test will be very reliable for a given student on a given day, for reasons I detailed a year ago in a long series of posts. But, what TVAAS is doing is weighing the results of hundreds of student tests involving a given teacher. It seems to me that averages will cancel out a number of individual factors, so that teachers will get some sort of average score, rather than an individual score. I don't want to give a kid a grade based primarily on a final exam; but I think I should be able to judge how well *I* have taught from the collective nature of 100 or 200 exams. When the median on the midterm in my second semester calculus course (when I was a freshmen) was 30, out of 84 possible points, the teachers of the 1500 students in the course knew something was wrong with how they had all taught that first part of the term. But it was not clear that any one student's test scores was definitely reflective of what s/he had learned or could do.

Anyway, what I am concerned about is ways of verifying TVAAS as a good indicator of good, bad, or average teaching; seeing what its limitations are in terms of what it can even identify probabilistically; and then trying to figure out how to state this so clearly and forcefully that legislators, media, and the public cannot misconstrue what the TVAAS results indicate, and how they need to be confirmed or disconfirmed. All this so that teachers do not feel threatened by the test itself, nor encouraged to try to take teaching shortcuts that merely shortchange kids. It may be that this cannot be done. But I don't accept that as the STARTING point in some a priori fashion.

=========================================================================

From: Rick Garlikov
Subject: Re: TVAAS and Dorn further

In response to Sherman's points. I was not as clear as I should have been about the "burden of proof" he discusses. There are things that TVAAS has done or tried to do to eliminate the kinds of problems high-stakes tests cause. Further, they have tried to be clear in their literature (though I think this could be much clearer yet) that TVAAS should not be high-stakes conclusive testing, but merely an indicator of problems. The question ought to be whether they have succeeded in keeping the tests used for evaluation from being high stakes tests or from being able to be predictably taught "to". It is not a question of burden of proof; it is a question of whether the whole program is relevantly similar to the kinds of high stakes tests, that do cause problems, documented in the literature.

As to "Renee Jones", it seems to me the rational thing for her to do would be to find out who has good test results and go talk with them about what they are doing that seems to work so well. It may be they do teach better than she, and that they can teach her how to do it. It may be that they use some trick she feels is unconscionable and short-changes kids. If so, TVAAS needs to be aware of this trick that skews their results. And they need to pay attention to the person reporting it.

But it may also be that Renee Jones just isn't a very good math teacher and isn't likely to ever become very good at it. Maybe she doesn't really understand math very well herself. What then? Do you want her to keep teaching math?

Finally, it boggles my mind that teachers --many teachers-- have no clue about what I guess is called "formative assessment" which to me just means doing whatever you can to try to monitor your students' knowledge and understanding as you go. I have gone over this ground before, but I just cannot see how one can be trained as a teacher, certified as a teacher, hired as a teacher, tenured as a teacher if one does not have at least some intuitive idea that one needs to monitor what kids are "getting" from one's instruction so that one can see whether modifications are necessary. Surely it does not take "resources" to teach this to ed students or for the state to expect it of teachers.

Of course there is some question whether what is tested is appropriate -- e.g., long division or whether long division ought to be part of the fourth grade curriculum. But TVAAS is set up to test whatever content area is specified by the curriculum; it is not their job to determine curriculum. I myself tend to suspect that the math curriculum is not too difficult for kids if it is taught correctly --which to me involves both practice AND deep understanding. But I am not certain primary grade teachers will have sufficient understanding themselves. I think finding the proper resources to teach math well will involve something other than finding resources for teaching about "formative assessment". It is about bringing people into classrooms who have enough math understanding to be able to teach math decently, even if they have the right teaching techniques and tools.
Rick Garlikov (dems042@uabdpo.dpo.uab.edu)

=========================================================================

From: Rick

Contrary to what Sherman says, teaching is done publicly. I count kids as being an audience that sees what is going on. And parents can observe, and also talk with teachers.

The important point he mentioned, however, is about ballplayers negotiating the stats that will be relevant to their contract bonuses, etc. That is only true within relevant limits -- generally the stats used have to be something that seem relevant to accepted productivity. Once in a while some screwy stat clause gets in and actually occurs, but generally the stat clause is about some obviously relevant achievement. And though teachers individually don't have the same luxury of contract negotiations baseball players do, they often have great collective clout both politically, and professionally --professionally where they help set standards that schools, teachers, and districts should be measured against. Although teachers sometimes fall victim to having irrelevant standards placed on their profession by outsiders, that is not always the case; and it doesn't have to be the case at all, I don't think.

From: "Fetler, Mark"

Rick, you said -

In reply-

Politics is not intrinsically good or bad. It is just the practice of resolving disputes or making decisions about the distribution of limited resources. I think what distinguishes evaluation from many other types of more basic or pure research is that evaluation is intended to serve policy makers. The decision to use evaluation is political, particularly if it affects policy decisions or the support for those decisions. Of course, there are always moral, ethnical, and relationship issues around the exercise of power. However, often one's perception of these issues is colored by the degree of benefit received. All this is true, in my opinion, even if the evaluation conforms to the most rigid technical standards and is "cleaner than a hound's tooth."

As to objective measures of teachers' contributions. I would submit that a really well designed, administered, scored, and reported achievement test is an objective measure -- of students' performance in answering certain questions. How that measure relates to teaching ability requires additional information, for example, about the students' abilities, opportunity to learn, etc.

From: Alan Davis

Rick,
First, I want to concur with Fetler and Stecher regarding the meaning of "political." You wrote that "In politics, people often use data to deceive" and went on to describe how politicians might intentionally mislead people through a selective interpretation or mis-interpretation of objective date. But as Mark and Brian pointed out, those of us who swim in the seas of policy analysis do not think of "political" as either good or bad. The term refers to processes affecting the ability to influence decisions -- processes having to do with power, in other words. Evaluation cannot escape being part of the the political process. Even the decision to evaluate has political consequences, even if no one ever reads the evaluation.

You long to introduce information into a politically neutral world, it seems to me, when you write, "Anyway, what I am concerned about is ways of verifying TVAAS as a good indicator of good, bad, or average teaching; seeing what its limitations are in terms of what it can even identify probabilistically, and then trying to figure out how to state this so clearly and forcefully that legislators, media, and the public cannot misconstrue what the TVAAS results indicate, and how they need to be confirmed or disconfirmed. Al this so that teachers do not feel threatened by the test itself, nor encouraged to try to take teaching shortcuts that merely shortchange kids."

It cannot be. As soon as I publish a list of teachers with their adjusted gain scores and my "clear and forceful" caveats about what these mean, some school board member will say, "These teachers at the bottom of this list should be put on probation" and a good many teachers will feel threatened by the test itself. I cannot conceive of any evaluative use of this information that would not put teachers under pressure.

From: Rick Garlikov

I have a few questions about TVAAS results, and the way scores are used and reported.

According to TVAAS results, "student gains were not related to the ability or achievement levels of the students when they entered the classroom." There are a number of possible causes for this, some good, some not so good. And I wonder whether TVAAS has any evidence which causes operate, and to what extent.

First of all, the evidence is somewhat counter-intuitive for individual students at least, if not for groups of students. It would seem that a "bright" kid with a good background in language, reading math, etc., curiosity, and motivation would learn a lot more than a kid with a disadvantaged background who is also perhaps a bit slower to pick up new concepts or understanding. Is the TVAAS result dependent in some way on averages for groups rather than direct comparisons of the "brightest high achievers" versus an average or "below average student"? That is, do a lot of bright high achievers lose motivation or slow down on gains for some other reason, in order to pull the "high-achiever" average down?

Or are you saying that with equal or really good teachers lower ability/achievement students can gain in some sense actually or proportionally as much as higher ability/achievement students? That a student who comes into third grade being facile with multiplication tables will not progress much further into math (e.g., division, long division, factoring, combining fractions through lowest common denominators, etc.) than a student who enters the third grade with no good grasp of either multiplication facts or an understanding of what multiplication is about?

Of course, one, I think bad, way this can happen is if high ability/achievement kids are ignored and simply not taught to their potentials, while teaching time, energies, and lessons are devoted primarily to lower ability/achievement students. This is the way many teachers teach, in fact, and it is the what some school districts seem to promote. Higher ability/achiever students are left to learn whatever they do on their own essentially, so their "gain" relative to their potential gain is low. This philosophy often stems from a reaction to an also bad, previous, philosophy where resources were channeled in helping the PERCEIVED best and brightest academically while letting other students slide more than would be helpful.

Which brings me to the question of whether the notion of "norms" does not serve as somewhat of a false standard for students of all levels of ability/achievement. Shouldn't the "standard" reflect, not some sort of mean, but some sort of reasonable potential, as gauged perhaps by what the best teachers, schools, or districts do, not what the average does. So that if one school or district consistently has significantly higher gain scores (along with other evidence that school is doing much for its students) with, say, the same relative budget) isn't that evidence such teaching is possible and that it ought to be the goal to aim for. Perhaps research evidence might demonstrate an even higher level to be reasonably possible under the typical school conditions. Shouldn't there be some sort of distinction made about how far from what is possible a school, district, or teacher does, not just how far from the average they do?

Of course, anyone can compute how far below the current best they are in regard to these particular gain scores, by subtracting; but I am not certain people bother to do that, or that if they do, they care about the result. If it were a reported number, it might be more of an incentive for schools to improve instead of trying to maintain or achieve merely the status quo average.

I raise this issue because I am concerned about teachers/parents/administrators having unreasonably low expectations. I am aware that there may be "curve setters" based on additional resources or circumstances not likely to be replicable in other systems. I am not after a standard that is out of reach, but the highest reasonably, or probably, attainable standards.

From: Gene Glass
Subject: TVAAS, Bright Kids, Gains, Fairness.

On Tue, 10 Jan 1995 14:28:52 CST Rick Garlikov said:

>I have a few questions about TVAAS results, and the way scores
>are used and reported.
>      According to TVAAS results, "student gains were not related
>to the ability or achievement levels of the students when they
>entered the classroom."  There are a number of possible causes
>for this, some good, some not so good.  And I wonder whether
>TVAAS has any evidence which causes operate, and to what extent.
>      First of all, the evidence is somewhat counter-intuitive for
>individual students at least, if not for groups of students.  It
>would seem that a "bright" kid with a good background in
>language, reading math, etc., curiosity, and motivation would
>learn a lot more than a kid with a disadvantaged background who
>is also perhaps a bit slower to pick up new concepts or
>understanding.

Indeed, Rick's Sense that something is not right here is quite understandable.

I originally asked TVAAS (whoever it is we are talking to) how it could be fair to compare the gains in achievement of students from one teacher to the next if the abilities (on average) of the students in the classes differed substantially. To this question, TVAAS long ago gave the incredible answer that it was an empirical fact that in their data there is no difference in the gains of bright children and slow children. Rick quoted the relevant sentences above.

This artefact is not because bright and slow children actually learn at the same rates; it is because the TVAAS system of calculating gains surely uses a form of least-squares estimation that forces the pre-year measures to be uncorrrelated with the post-year measures.

I can take a group of IQ equal 120 kids who gain 1.5 years in grade equivalent units in a school year and a group of 80 IQ kids who gain .75 grade equivalent units in a school year and calculate the residualized gain score for the entire group and as sure as God made little green apples, the "gain" score will be perfectly uncorrelated with the pretest score. This is no paradox, nor does it make much sense as a measure of true gain.

I continue to believe that TVAAS has no fair way of equating the teachers who are compared with respect to the ability of their pupils.

Rick, you are right. TVAAS did not give an answer that was responsive to the question, and thus succeeded for a time in finessing an important criticism of the system.

=========================================================================

From: Rick Garlikov

Alan Davis said:

>It cannot be.  As soon as I publish a list of teachers with their
>adjusted gain scores and my "clear and forceful" caveats about what these
>mean, some school board member will say, "These teachers at the bottom of
>this list should be put on probation" and a good many teachers will feel
>threatened by the test itself.  I cannot conceive of any evaluative use
>of this information that would not put teachers under pressure.

You are saying school board members will ignore the caveats and will not be challenged by anyone in doing so. I would think that if the caveats ARE clear and forceful, it is likely there would be a strong challenge if school boards were ignorant enough to act in the way you predict.

>On the other hand, I don't think any of us who are complaining about
>TVAAS would argue that teaching should not be evaluated.  How best to do
>that is an important discussion to have.  I am arguing that the
>evaluation should not involve students' gains on standardized tests, even
>though I believe that student learning is what teaching is mostly about.

That is an agument worth pursuing; and I hope we can pursue that. As I understand the reasons for that though, they are that in high stakes testing, teachers tend to teach "to" the tests, and in doing so take shortcuts, leave out other important elements of education, etc. But as I said in response to Sherman, since TVAAS and/or TCAP has taken steps (1) to discourage teaching to the test by changing the test each year, and (2) to prevent this from being high stakes testing, by making clear it is only one form of indicator or evidence about the quality of teaching, and that in content areas alone, it seems to me the question is whether they have succeeded or not, or could.
Why do you think they have not succeeded or that they cannot succeed?

From: Rick Garlikov

Mark,
TVAAS claims to be able to isolate learning from the sorts of factors you mention --ability, opportunity to learn, etc. The question is whether this claim will stand up. But it is not that TVAAS has ignored the sorts of factors you mention; and it is not that they were unaware of their potential influence. The only question in THIS regard is whether they have satisfactorily accounted for these factors or not.

From: Sherman Dorn

Rick Garlikov writes:

 
>    You can call looking for, and trying to compile and describe, relevant
>data a political act; but that just confuses a whole bunch of things under
>a designation that applies in only the most superficial way, if it can be
>said to apply at all.

Rick, am I incorrect in reading your argument as follows? "It is unfair to call statistics political when people try to gather them in good faith; things that a political are manipulated."

I think this is a narrow reading of politics, and it misses the point that often actions have political assumptions precisely WHEN people act in good faith. Good faith judgment, in my view, is irrelevant to whether there is a politics of statistics. Wiliam Alonso and Paul Starr edited a book, _The Politics of Numbers_, several years ago, which discussed the production of government statistics, their political assumptions, and implications. I remember some discussion of the political assumptions of some government statistics, but I don't recall either a conspiracy theory or a description of much crass political motivation.

From: Greg Camilli

I am reposting an earlier message regarding TVAAS. I hope these questions will fetch responses, though I suspect the TVAAS staff are pretty busy.

First, it was indicated earlier that the norm-referenced items used by TCAP are somehow related to the CTBS/4. In this regard, I'm wondering if items are sampled from the CTBS, or whether new items are being written at every assessment. The latter is suggested by the phrase "fresh, non-redundant, equivalent tests."

Second, a number of different metrics are available from CTB: is TVAAS using the IRT (developmental) score scale which was established with national norms? Is there a document I could read to get the specifics of how the tests are constructed and normed? If so, it would really save a lot of time.

Third, I'm wondering if my interpretation of the missing data procedure is correct. I understand that multiple imputation does not affect the values of "effect" coefficients. The models and methods used to obtain these bypass the estimation of individual scores. However, multiple imputation increases the standard errors of the estimated coefficients -- this simply refects the notion that with less than complete data, less is known about the parameter. Thus, multiple imputation is a post hoc adjustment to estimation. However, with multiple imputation one can always produce a "complete" set of data (where imputed values have replaced missing values) to expedite reporting or secondary analyses. (Please do not hesitate to correct errors in this formulation.)

Finally, I hope that Harvey Goldstein can find time to study and comment on the standard errors reported to us by Rick. Based on his previous post I expected to see larger values. BTW, I've been wondering what the standard errors mean. Usually, I have in mind that a sample is drawn from a population, and an effect (say gain score) is estimated from the sample data. The standard error then conveys how precise this estimate is (much like the "margin of error" that pollsters use). For TVAAS, what are the sample and population?

From: Rick Garlikov

Sherman Dorn said:

>Rick, am I incorrect in reading your argument as follows?  
>"It is unfair to call statistics political when people try 
>to gather them in good faith; things that
>are political are manipulated."

No, that is not what I mean. I am not talking about whether it is fair or unfair to call something political. I am merely trying to capture the standard sense of distinguishing political acts from non-political ones. Politics tends to involve promoting an ideology or vested interest, often through some sort of position of power. It does not have to involve deceit; and I am sorry I mentioned that aspect of it before. Politics is generally distinguished from (attempted) dispassionate, reasoned, unbiased judgment.

This distinction is embodied in such dichotomies as "political" vs "non- political" appointments; or in saying that the courts are supposed to be a non-political part of government. Or in distinguishing between political trials and other sorts of trials, even though the non-political trials may also have ramifications for many people's lives. It is supposedly a bad thing to "politicize" the process of appointing Supreme Court justices, and democrats and republicans both often agree someone is qualified to serve and is not a political nominee. Of course, many supposed non-political things are actually political; and many politically appointed judges tend to ignore their own political views --often disappointing those who appointed them to the bench. The difference has to do with whether one is pursuing a bias or whether one is open to evidence and reason in some at least relatively unbiased way.

The kind of things you, Mark, Brian, and Alan are discussing are value-laden, of course, but being value-laden is not necessarily the same as being political. A scientist trying to find an answer to some problem may have to choose among a number of research avenues; and s/he may make an educated guess of some sort about which might be the most beneficial to pursue. Ordinarily, merely in the confines of the lab, that kind of choice is not a political choice, even though it may involve expenditures of funds, directions of lots of people, etc. It is not the same kind of thing as choosing to pursue a line of research because that is where government funding is easy to secure or because it leads to promotion and tenure at the university, fame, or whatever. These latter I take to be political considerations about what research to pursue.

My point is simply that the average citizen makes a distinction between "political" decisions and "non-political" ones, and even though there may be some reason to argue his distinctions are sometimes fuzzy or erroneous, still there seems to be some sort of clear-cut cases; and if you say something is "political" in policy-analysts' technical sense of the term that is not political in the ordinary sense of the term, people will tend to dismiss what you are saying.

From: Leslie McLean

The purpose of this post is to focus on two aspects of the TVAAS that I feel have received too little attention: validity and standard errors. This is not to say that the political nature of any evaluation is not important or to take anything away from the discussion of formative vs. summative evaluation. My other point will be that we are getting mixed messages as to the purpose of TVAAS. Since I started to put this collection together, Greg Camilli has posted some related questions.

What follows is a selection from several postings concerning the TVAAS ((Alan Davies, Rick Garlikov, TVAAS, ...)), with comments from Les McLean added. Les's comments are enclosed in double parentheses: ((...)). Emphasis added by Les is signaled by *** ... ***

((Selections from Alan Davis's post of Wed. January 4. He amplified these on Jan. 9)) "The main problem, I believe, is that the multiple choice measures used in Tennesee are at the outset poor proxies for the tasks we really want kids to be able to perform in school, and when we attach consequences for teachers around scores on these measures, very bad consequences are likely to result.

Note first, that performance on the measures valued by TVAAS are not valued by students. The tests are not part of any instructional unit. No grade is associated with performance on them. Students have no intrinsic or extrinsic motivation to perform well on them, especially if they don't like mental puzzles for their own sake. *** The problem, which I do not believe can be avoided, is that the validity of tests is inseparable from the uses to which they are put. A test that may be valid as a research tool to discover effective teaching will lose its validity when teachers suspect that the outcome has consequences for themselves. ***" ((emphasis added--and stressed Jan. 9))

((Date: Wed, 4 Jan 1995 20:52:42 CST From: Rick Garlikov))

"TVAAS has already said in a previous post that their results are supposed to be used as INDICATORS that something is very good, or very wrong, about a school/district/teacher, not conclusive evidence. That leaves room for the kind of evidence to be given about the merit of schools/districts/teachers which IS reasonable to educators -- a concern that Sherman expressed in explaining (1). ((Emphasis in original))

.....From what has been said so far, TVAAS tries to ascertain how well certain subject content areas --as prescribed outside of TVAAS ***are being taught,*** not whether those subject content areas are important or whether "knowing" them in certain ways is important."

((Emphasis added--this is the clearest statement I could find that they equate quality of teaching with scores on standardized tests, scores that have been scaled and transformed in unknowable ways--see remark from TVAAS below.))

((Rick goes on to make it very clear,)) ".....I may be misunderstanding the point of TVAAS, but I thought it WAS established to give summative evaluations about teaching competence/progress, not formative evaluations about improving instruction. I take it that TVAAS is not about helping teachers/schools/districts do their jobs, but about suggesting, in some attempted objective way, to everyone how well or poorly they have done it."

((But then he seems to change his mind,)) ".....TVAAS is a statistical model for program evaluation. It is not a "technocratic tool," as Dorn so colorfully phrases it." ((When most people refer to program evaluation, they include content choice, teaching materials and teaching methods, and they distinguish this from 'teaching competence/progress'. Given the complexity of the 'model', Sherman Dorn ought to be allowed to refer to it as a 'technocratic tool'.))

....It was not developed by the State of Tennessee but by Dr. Bill Sanders, a statistician, for the purpose of addressing problems previously encountered in using student achievement data in educational assessment.

The State of Tennessee adopted it as the model for such assessment because TVAAS was able to supply valid, reliable, unbiased data based on student gains, data the State thought was important.

.....TVAAS merely provides measures of student academic gains, surely a useful component of any educational assessment system. .....TVAAS is a sophisticated statistical process ***that is beyond the capacity of many to grasp mathematically.*** However, TVAAS data is reported in a manner that is comprehensible to any educator willing to look at the reports and graphs supplied, with explanations, specifically for the purpose of rendering the data USEFUL. Judging by reports from many schools and systems across Tennessee, TVAAS data is now extensively used for curriculum planning and development.

((Yes, the mathematics is beyond most people's capacity--including most of the officials in Tennessee. How much beyond cannot be determined until they publish a full technical report--such as the one's ETS publishes about its surveys. The term, "mixed model" is quite inadequate to describe the scaling and multilevel model fitting that is going on. BUT--the data are being used for 'curriculum planning and development'????? THIS BRINGS ME TO MY MAIN POINT: what evidence is there that the "data" resulting from these complex statistical manipulations can be interpreted as INDICATORS of competent teaching--much less the quality of the curriculum? From the TVAAS, we learn of an even more ambitious outcome--they provide schools with gain scores FOR INDIVIDUAL STUDENTS. See what they say:))

......As we have repeatedly told Dorn, TVAAS does not lock the state into any testing program whatever. TVAAS can use any assessment data that provides appropriate linear measures. In other words, TVAAS requires scalable data because it assesses progress over time.

((Appropriate linear measures? ... scalable data? To anyone familiar with test theory since Lord and Novick (1968), these phrases suggest a very considerable locking in, even given Geoff Masters's extensions to IRT. The numbers the TVAAS are reporting back to schools are a long, long way from the item responses of the students to these multiple choice questions that Alan Davies quite rightly characterises as, " poor proxies for the tasks we really want kids to be able to perform in school". Even if we accept the items as valid indicators, we are entitled to ask for some careful checks that the scaled and imputed and estimated scores really do, "provide guidance as to 'What we do now'".

According to Sherman Dorn (8 Jan. Post), ))

"Now, I don't know whether TVAAS is consistent with other measures of school achievement -- thus far, the only study I am aware of that preceded the state's imposition of TVAAS was a mid-1980s test of the feasibility of the mixed-model methodology. I know of no empirical comparisons between TVAAS estimates of teacher effects and other measures of student gain."

((Concerning the size of the standard errors))

Below are listed the mean gains for math with their standard errors for schools within one of the larger school systems in Tennessee. These means are three year averages and were calculated from the TVAAS mixed model process. This should give an idea of the sensitivity of the process.

((A very few of the results are listed below for reference--the heading (Grade ...) has been edited to make sense. I echo Greg Camilli's wonderment at the size of these 'Mean std. Err.'s. Including all data from several years and using imputation procedures for the (inevitably) missing numbers all lead to larger standard errors.))

            INTERMEDIATE SCHOOLS
 
        GRADE     RANK   GAIN   MEAN STD. ERR.
 
           3        1   71.596     4.931
           3        2   71.205     3.714
           3        3   67.038     2.624
                ........................
 
 
                 MIDDLE SCHOOLS
 
           6        1   22.624     1.043
           6        2   20.176     0.824
           6        3   15.602      1.07
           6        4   13.943     1.099
           6        5   13.152     1.286
           6        6   12.521     1.051
           6        7   11.146     0.961
           6        8    9.897     1.194
           6        9    9.362      1.34
           6       10    8.465      0.998
                     ...................
 
           8       12    13.55      6.605
           8       13   13.261      1.738
           8       14    11.08      0.98
                 ......................

((The problem for those of us who have calculated, pondered and puzzled over such results as these, in national and international assessments, is that the reported standard errors are unbelievable (impossibly small). We can't say they are wrong, of course, because we lack the details of the calculations, but Harvey Goldstein has analyzed at least as much data and written several books and taken the lead in multilevel modeling (sometimes called, by others, hierarchical linear modeling), and his informed and experienced "opinion" is not to be taken lightly. The standard errors remind me of those Richard Wolfe found faulty in the first International Assessment of Educational Progress--the fault being that the estimates of error failed to include all the components reasonable people agree should be included. Moreover, the Std. Errors above are clearly proportionate to the mean scores, not a desirable outcome. The must be at least one error (three lines from the bottom of those displayed above). I, too, will leave to later a comment on the statement below from TVAAS, except to say that whatever it is they do is not "certainly sufficient":

"...Additionally, there was a question concerning how TVAAS deals with missing scores. We will write a longer respose to this question later. But, briefly, it is more like the analysis of repeated measures. However, we do include all of the scores among subject-grade combinations. This is certainly sufficient and avoids the issues and problems associated with imputations."

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

From: Leslie McLean

Your questions are much appreciated, Rick, as they bring out unclear points and (we hope) useful elaborations.

1. The problem with pairing student achievement with teacher competence is that the teacher, or even two teachers in succession, can only build on what the students bring to the teacher's class, and then they only add a small amount to the total experience of the students. In short, it is simply not fair to judge a teacher by what the students learn in a year unless you are able to bring in lots of contextual information, not all of it quantifiable. This argument has been made in many places and many ways, so I only espouse it. I conducted a province-wide (read: state-wide) survey of student achievement in mathematics and English in 1981 (mostly constructed-response items, thousands of them) in Ontario, and I produced distributions of classroom mean achievement, suggesting that as many as 20 percent of teachers were in need of some help. BUT my contract specified that the teachers not be "identifiable" (not just not identified--not identifiable), a sensible provision in view of the large-scale nature of the survey--a sample, but covering a larger area than Tennessee, and at least as diverse. We are not unaware of statistical developments here in the Great White North--just skeptical when the mathematics cannot be understood except by a half-dozen people who have not been in a classroom in more than 20 years.

2. What I am arguing is that whether the "math is sound" is not a matter of correct calculations. These multi-step scaling and estimation procedures involve important substantive decisions at every step--decisions often made by well-meaning programmers or statisticians unfamiliar with the context from which the item responses came. The programmers and statisticians cannot ask the educators because the math is so abstruse. In summary--the numbers the TVAAS report are so far removed from the test items that students answer that NO ONE can be very sure the numbers correspond to any reality.

It is this technical gap we worry ;about and ask to be bridged, and we see no bridge. Until there is a bridge, we cannot discuss how appropriate the process is.

=========================================================================

From: Leslie McLean
Subject: Les stands corrected

Apologies to Rick for attributing to him some text from TVAAS. I knew when I was putting it together that this was a danger, and I was not careful enough. Mea Culpa.

From: Rick Garlikov
Subject: Your TVAAS post

Les,
I was glad to see your post. Just wanted to make one small correction first. You accidentally confused one of my own posts with one that was passed on by me from TVAAS. I am in the odd position of having my name and return address on all the posts they send me to forward, so some of them are bound to be seen as MINE by accident. The part right after where you said "((But he seems to have changed his mind)) ....TVAAS is a statistical model for...." is from them, not me.

However, I am not quite sure about the point you want to make regarding the apparent inappropriateness of assessing teachers by their students' progress on certain kinds of multiple choice types of tests. Ignoring for the time what "program evaluation" tends to cover normally, isn't there some point in trying to assess how well students are in general learning the specific kinds of content "knowledge" the State has already said they want taught? I see TVAAS as supposedly trying to measure how well schools, teachers and districts have done in helping kids "gain" increased knowledge or skills over the course of a year. I cannot comment on the reliability or soundness or meaningfulness of the mathematics of all this, but it seems to me that if the math is sound, that the concerns Alan, you, and Sherman have is not with TVAAS as much as it is with the kinds of tests they are being asked to evaluate and report on. TVAAS seems to me to have been quite clear about the possible limitations of those tests in terms of "general education" of students and other aspects of schooling. Their's seems to be a quite narrow focus; and one that is reasonable. The problem is in making that either a prime matter of assessment of teachers, or if Alan and Sherman are right, ANY part of the assessment of teachers. I tend to think it is a reasonable PART of the assessment of teachers. But it surely should not be the only aspect of teacher evaluation; and TVAAS says that it should not. Do you disagree with their and my view of this? Do you think there is no place for standardized test results about certain parts of the curriculum in seeking INDICATIONS of teachers' competence in those areas?

From: Gene Glass
Subject: Re: TVAAS and norms -Reply

On Fri, 13 Jan 1995 07:55:49 -0600 Richard Swerdlin said:

>    Relatedly, an elementary principal in East St. Louis, Illinois
>once told me that he felt guilty over doctoring achievement test
>results (Stanford Achievement Test).  He was fearful of possible
>criticism, if results were below expectations.  Occasionally such a
>case does surface in various parts of the country.
>In conclusion, it is possible that the Texas news item is applicable
>elsewhere too.

More than merely possible...a predictable consequence of these "high stakes" systems. When a large district in Texas--some 10 years ago--instituted a scheme of paying principals bonuses (up to $15,000) for test score gains, a couple of princpals were discovered to have doctored their results. Large gains on the target ("high stakes") test were common while no gains on closely related tests were also common throughout the district. These are demeaning situations in which to place educators.

Would those who administer the TVAAS program be willing to stake their yearly raises on a "significant" score gain on National Assessment for the state of Tennessee? Or does this kind of "accountability" only apply to those under us?

The effects of these high stakes arrangements on teachers and curriculum have been much studied, and they are as one would expect: debasing the curriculum (as judged by teachers and educators), transformation of inquiry into "drill and kill," teaching the test and sometimes worse.

These are difficult matters to talk about because the effects of these programs place many teachers in circumstances in which they act unprofessionally. Then the politicians feel that their suspicions that teachers are unprofessional hacks are confirmed and they institute even further demeaning controls.

It is a sign of the political powerlessness of teachers that these "high stakes" testing systems exist. They are not applied to lawyers; they are not applied to physicians; they are not even applied to college professors.

And on this question of power plays. Does anyone else feel demeaned as I do by this arrangement that TVAAS has sucked us into for discussing these matters with them? We don't speak directly to them; our messages are carried to them as if to the great Wizard of OZ because they are too busy to be bothered by discussion--as if we who work here are mere slugs with nothing better to do with our time than nettle these great benefactors of society. There is greater combined statistical expertise among the participants in this forum than on the TVAAS staff, and those who are on the TVAAS staff have a professional responsibility to the discipline that extends beyond intermittent puffery and bluster.

From: Harvey Goldstein

I see that Les McLean and one or two others have taken up my query about how well schools can be separated taking the estimated standard errors into account. I don't yet know how the standard errors have been calculated, but based upon a table Sandra Horn sent me, I would say that the results (e.g. for grade 3 based upon a 3 year average in one of the larger school systems, are in line with our own results. What you do (roughly) is multiply each standard error by about 1.5, use this to place and interval (i.e. +-1.5 s.e.'s)about each gain estimate and judge whether two schools are significantly different at the 5% level by whether or not the intervals overlap. Most intervals do! BUT if you average over 3 years then you get smaller standard errors so fewer do. A particular problem with averaging over 3 years is that the data are even more out of date than using the latest 1 year data since they refer to a cohort who started x years earlier where x = years between intake measure and output measure + 2. Is such historical data of great use. One should at least be measuring trends over time. My own view is that value added procedures may have some use as a crude screen to detect highly 'overperforming' or 'underperforming' schools etc but are not diagnoses - such definitive judgements can only be made by studying what goes on in schools in more detail. A paper is shortly to appear in J. Royal Statist, Soc, A, which gives details of the interval setting procedure outlined above. Anyone who wants an advance copy please send me their FAX number.

From: Leslie McLean

This is an ASCII file, output from MINITAB, but it might arrive a mess or be so chopped up as to be unreadable. If you can read it, you see a simple analysis of the gains and standard errors Rick sent us from TVAAS. My observation that the gains were proportional to the standard errors does NOT seem to be true--within grades. If you lump all grades together, the correlation is over 0.5, but within grades (the correct plot, IMHO) it is essentially zero. Grade six shows a substantial NEGATIVE correlation, but there are only 12 observations.

What are these standard errors anyway? In a separate post to me, Greg Camilli points out that if all students are tested, then the "sampling error" has to be zero. What we need to make sense of this is, as I have said already, a technical report. How are they modelling the error in their multilevel models? What explanatory variables do they use? (Rick says that TVAAS says they allow for OTL ...) Do they include covariance terms? Is the "standard error" an estimate of measurement error? Just how much data is missing? etc... etc... etc...

 
 MTB > note     DATA ON GAINS AND STANDARD ERRORS BY GRADE FROM TVAAS
                A=Grade 3, B=Grade 4 ..., F=Grade 8
 
  stderr  -
          -          F                                A
          -
       6.0+
          -              B                        A
          -               C       A         A        A
          -             B    B        A     A     A         A
          -           B           A   B B       A
       4.0+          C     B  2 C2     B    B B  A A
          -              2      C   A CB 2AAAA       2A  A A
          -                 C B2 4C BBB     3B   A A
          -             CCCC C2CBBCCBA423   A A A        A
          -         C  CB  C 2 B2CC 3C  B3  B   A A
       2.0+                222  3C2B  BA       A
          -          F  F C CCCCC     B    A  A
          -   D  DDDD346EFF5  C  C 2
          -       D2  F E2  EE C
          -
            +---------+---------+---------+---------+---------gain
            0        15        30        45        60
 
 MTB > boxplot 'stderr'     (I believe at least one outlier is an
                             error--the one from Grade 8)
 
 
                      -------------
                ------I     +     I--------------      * *
                      -------------
           +---------+---------+---------+---------+---------stderr
         0.0       1.5       3.0       4.5       6.0
 
 MTB > info c11-c22   (The data, unstacked by grade--outliers in)
 
 Column   Name          Count
 C11      gain3            48
 C12      err3             48
 C13      gain4            48
 C14      err4             48
 C15      gain5            48
 C16      err5             48
 C17      gain6            12
 C18      err6             12
 C19      gain7            12
 C20      err7             12
 C21      gain8            14
 C22      err8             14
 
1
 MTB > plot 'err3' 'gain3'
 
          -
  err3    -
          -                                      *
          -
       6.0+
          -                                *
          -       *               *             *
          -              *        *        *              *
          -         *                   *
       4.0+       *                      *   *
          -           *      * ** **            2*   *   *
          -      *                **     *  *
          -             *** 2     *  * *             *
          -                  2*         *  *
       2.0+                *          *
          -                     *    *
          -
          -
            ----+---------+---------+---------+---------+-----gain3
               30        40        50        60        70
 
 MTB > corr 'err3' 'gain3'
 
 Correlation of err3 and gain3 = 0.188
 
 MTB > plot 'err6' 'gain6'
 
      1.40+
          -     *            *
  err6    -
          -                         *
          -
      1.20+                   *
          -               *
          -
          -                           *  *
          -                        *                   *
      1.00+                *
          -                     *
          -
          -
          -                                       *
      0.80+
          -
          -
            --------+---------+---------+---------+---------+-gain6
                  5.0      10.0      15.0      20.0      25.0
 
1
 MTB > corr 'err6' 'gain6'
 
 Correlation of err6 and gain6 = -0.598
 
 MTB > plot 'err8' 'gain8'
 
  err8    -
          -             *
          -
       6.0+
          -
          -
          -
          -
       4.0+
          -
          -
          -
          -
       2.0+
          -            *                *
          -              *  *        2*           *   *   *
          -   *                  *                *
          -
            --------+---------+---------+---------+---------+-gain8
                 12.5      15.0      17.5      20.0      22.5
 
 MTB > note   NOT MUCH POINT IN CALCULATING PEARSON R HERE, EH?
 
 MTB > copy 'err8' 'gain8' c51 c52;
 SUBC> omit 'err8'=6.0:8.0.           (get rid of the outlier)
 MTB > name c51 'err8trim'
 MTB > name c52 'gain8trm'
 MTB > plot c51 c52
 
      1.80+
          -            *
  err8trim-
          -
          -
      1.50+
          -                             *
          -
          -                                           *   *
          -
      1.20+                           *
          -                          *            *
          -              *  *        *
          -
          -   *                  *
      0.90+
          -                                       *
          -
            --------+---------+---------+---------+---------+-gain8trm
                 12.5      15.0      17.5      20.0      22.5
 
 MTB > corr c51 c52
 
 Correlation of err8trim and gain8trm = 0.039
 
 MTB > NOTE    STILL SEEMS TO BE AN OUTLIER (A DIFFERENT ONE)
 MTB > note       NOT PURSUED.

From: Rick Garlikov
Subject: TVAAS responses

From TVAAS:
All this is to say that I am looking at about a dozen posts regarding TVAAS tonight, alone. I don't know whether we can take on the philosophical discussions, simply because we don't have the time, but there are others among you who offer alternate viewpoints, eloquently. I honestly think we are only going to have the time to write responses about the model, what it does and how it does it, although I will provide all the answers I can about how TVAAS is being used by teachers, administrators, and others.

In closing, I would like to direct you to our article, "The Tennessee Value-Added Assessment System (TVAAS): Mixed-Model Methodology in Educational Assessment," in the _Journal of Personnel Evaluation in Education_ 8:299-311, 1994. It may answer some of your questions about the model.

Thanks for your interest in TVAAS and for your patience.

From: Sherman Dorn

The following are excerpts from the 1992 Tennessee state act creating TVAAS and the relevant sections ($$, if you'll excuse the ASCII notation):

$49-1-601.

(a) There shall be performance goals for each school district which shall include, but not be limited to, determinations based on the current status of each local school system as determined through the value added assessment provided for in $$ 49-1-603 -- 49-1-608.

(b) The goal is for all school districts to have mean gain for each measurable academic subject within each grade greater than or equal to the gain of the national norms.

(c) If school districts do not have mean rates of gain equal to or greater than the national norms based upon the TCAP tests (or tests which measure academic performance which are deemed appropriate), each school district is expected to make statistically significant progress toward that goal. The rate of progress within each grade and academic course, necessary to maintain compliance with $$ 49-1-209, $$49-1-210, and this part, will be established after two (2) years of consecutive testing with tests adopted for each grade and subject, as provided in $$49-1-603 -- 49-1-608. Schools or school districts which do not achieve the required rate of progress may be placed on probation as provided in $49-1-602. If national norms are not available then the levels of expected gain will be set upon the recommendation of the commissioner with the approval of the state board.

[$$49-1-209, -210 deal with administrative oversight by the state commissioner and board of education; it does not specify directly anything about TVAAS or specific items of accountability.]

(d)All schools within all school districts are expected to maintain appropriate levels of school attendance and dropout rates. The 1991-1992 school year is the base year for measuring levels of attendance and dropout rates. Schools which do not maintain appropriate levels, as set by the state board on the recommendation of the commissioner, may be placed on probation, as provided in $49-1-602.

(e) There is a rebuttable presumption that if a school or school district has not achieved the goals pursuant to subsection (c) or maintained attendance and dropout rates pursuant to subjection (d), it is out of compliance with the requirements of $$49-1-209, $$49-1-210, and this part and subject to probation as provided for in $49-1-602.

$49-1-602. ...

[lots of administrative detail about probation, then]

(c) ... If after two (2) consecutive years a system remains on probation, the commissioner is authorized to recommend to the state board that both the local board of education and the superintendent be removed from office. If the state board concurs with the recommendation, the commissioner shall order the removal of some or all of the board members and/or superintendent and shall declare a vacancy in the office or offices....

[procedures for filling the vacancies follow here]

$49-1-603.

(a) Value added assessment means:

(1) A statistical system for educational outcome assessment which uses measures of student learning to enable the estimation of teacher, school, and school district statistical distributions; and

(2) The statistical system will use available and appropriate data as input to account for differences in prior student attainment, such that the impact which the teacher, school and school district have on the educational progress of students may be estimated on a student attainment constant basis. The impact which a teacher, school, or school district has on the progress, or lack of progress, in educational advancement or learning of a student is referred to hereafter as the "effect" of the teacher, school, or school district on the educational progress of students.

(b) The statistical system shall have the capability of providing mixed model methodologies which provide for best linear unbiased prediction for the teacher, school and school district effects on the educational progress of students. It must have the capability of adequately providing these estimates for the traditional classroom (one (1) teacher teaching multiple subjects to the same group of students), as well as team taught groups of students or other teaching situations, as appropriate.

(c) The metrics chosen to measure student learning must be linear scales covering the total range of topics covered in the approved curriculum to minimze ceiling and floor effects. These metrics should have strong relationship to the core curriculum for the applicable grade level and subject.

$49-1-604.

[This section refers to several published articles, including one co-authored by Sanders in American Statistician, February 1991.]

$49-1-605.

[This section provides for annual estimates of school and district effects.]

$49-1-606.

(a) On or before July 1, 1995, and annually thereafter data from the TCAP tests, or their future replacements, will be used to provide an estimate of the statistical distribution of teacher effects on the educational progress of students within school districts for grades three (3) through eight (8). A specific teacher's effect on the educational progress of students may not be used as a part of formal personnel evaluation until data from three (3) complete academic years are obtained. Teacher effect data shall not be retained for use in evaluation for more than the most recent five (5) years. A student must have been present for more than one hundred fifty (150) days of classroom instruction per year or seventy-five (75) days of classroom instruction per semester before that student's record is attributable to a specific teacher. Records from any student who is eligible for special education services under federal law will not be used as part of the value added assessment.

(b) The estimates of specific teacher effects on the educational progress of students will not be a public record, and will be made available only to the specific teacher, the teacher's appropriate administrators as designated by the local board of education, and school board members.

[The following sections detail additional matters regarding test security and requiring fresh test items annually.]

________________________________________________

From: Sherman Dorn
Subject: Re: Laying down the law

>Please let there be a historian, investigative journalist, snooper or
>other seeker after truth who documents how this legislation came to be
>enacted in its present form!

I am an historian, but I have not had time to research this. (I moved to TN in late 1993, well after the relevant events.)

I gather (and the TVAAS folks can correct me) that Governor Ned McWherter decided to enact a wholesale reform of educational finances (including a plan to dramatically increase spending on education over several years) in 1992. The state senator who chaired the education committee (Albright is his name, I believe, though he was defeated in the primary last year) was very active in inserting the TVAAS language into the bill as an accountability mechanism. My assumption is that he, and other legislators, didn't want to spend a lot more money without some check on the "product." Very reasonable; I just disagree with how they went about it.

More detailed materials are at the state archives downtown in Nashville. Someday I may have a chance to consult the committee and whole-body legislative records.

From: Rick Garlikov
Subject: Re: TVAAS legislation

From TVAAS:

----------------------------Original message------------
We would like to thank Sherman Dorn for providing sections of the legislation regarding educational accountability in Tennessee. We would like to comment on a few sections. We are deleting the sections upon which we have no comment to save your time and space but refer you to Dorn's original post for the balance.

On Sat, 14 Jan 1995, Sherman Dorn wrote:

> The following are excerpts from the 1992 Tennessee state act creating
> TVAAS and the relevant sections ($$, if you'll excuse the ASCII
> notation):
>
>         $49-1-601.
>
>         (a) There shall be
> performance goals for each school district which shall include, but not
> be limited to, determinations based on the current status of each local
> school system as determined through the value added assessment
> provided for in $$ 49-1-603 -- 49-1-608.
>
Please note that the performance goals are not limited to TVAAS findings.
 
>         (b) The goal is for all school districts to have mean gain for
> each measurable academic subject within each grade greater than
> or equal to the gain of the national norms.
>
>         (c) If school districts do not have mean rates of gain equal
> to or greater than the national norms based upon the TCAP tests
> (or tests which measure academic performance which are deemed
> appropriate), each school district is expected to make statistically
> significant progress toward that goal.  The rate of progress within
> each grade and academic course, necessary to maintain compliance
> with $$ 49-1-209, $$49-1-210, and this part, will be established after
> two (2) years of consecutive testing with tests adopted for each
> grade and subject, as provided in $$49-1-603 -- 49-1-608.  Schools
> or school districts which do not achieve the required rate of progress
> may be placed on probation as provided in $49-1-602.  If national
> norms are not available then the levels of expected gain will be set
> upon the recommendation of the commissioner with the approval
> of the state board.

Please note that, in this and in all other sections of the law pertaining to placing schools and systems on probation, the language is permissive, not directive. Schools and systems MAY be placed on probation; superintendents and school board members MAY be removed. There is no language in the law that states that these actions MUST take place.


> [$$49-1-209, -210 deal with administrative oversight
by the > state commissioner and board of education; it does not specify
> directly anything about TVAAS or specific items of accountability.]
>
>         (d)All schools within all school districts are expected to
> maintain appropriate levels of school attendance and dropout rates.
> The 1991-1992 school year is the base year for measuring
> levels of attendance and dropout rates.  Schools which do not
> maintain appropriate levels, as set by the state board on the
> recommendation of the commissioner, may be placed on probation,
> as provided in $49-1-602.
>
>         (e) There is a rebuttable presumption that if a school or school
> district has not achieved the goals pursuant to subsection (c) or
> maintained attendance and dropout rates pursuant to subjection
> (d), it is out of compliance with the requirements of $$49-1-209,
> $$49-1-210, and this part and subject to probation as
> provided for in $49-1-602.
>
>         $49-1-602.  ...
>
> [lots of administrative detail about probation, then]
>
>         (c) ... If after two (2) consecutive years a system remains
> on probation, the commissioner is authorized to recommend to
> the state board that both the local board of education and the
> superintendent be removed from office.  If the state
> board concurs with the recommendation, the commissioner shall
> order the removal of some or all of the board members and/or
> superintendent and shall declare a vacancy in the office or offices....

Please note all of the safe-guards in this section. Both the Commissioner of Education and the State Boards of Education would have to agree to this very drastic action. It cannot happen "automatically."

==================================================================

From: Harvey Goldstein

Re Les Mclean's message of 13th about standard errors. He quotes Camilli as stating that the standard errors given are 'sampling errors' and that if all students are tested then these are zero. I am confused! The usual standard errors quoted in this context are those relating to the accuracy of the estimated school effects where there is a conceptually infinite population of students of whom those measured (whether they are all those in the school at a particular time or not) are a random sample. If they are not this then what are they?

From: Gene Glass

On Mon, 16 Jan 1995 15:10:34 -0500 Greg Camilli said:
Harvey Goldstein raised questions about what Greg Camilli said about "standard errors" thusly:

>>Re Les Mclean's message of 13th about standard errors.
>>He quotes Camilli as stating that the standard errors
>>given are 'sampling errors' and that if all students are
>>tested then these are zero. I am confused! The usual
>>standard errors quoted in this context are those
>>relating to the accuracy of the estimated school effects
>>where there is a conceptually infinite population of
>>students of whom those measured (whether they are all
>>those in the school at a particular time or not) are a
>>random sample. If they are not this then what are they?

Greg answered with a hypothetical conversation between an educator and a statistician. I think Greg exposed some key problems with this notion of standard errors, and it is no more a problem with TVAAS than it is a problem with most applications of inferential statistics in education.

Harvey asks, in effect, what is wrong with regarding standard errors as being measures of the accuracy of samples as representations of "conceptually infinite populations" from which the samples might "conceivably have been drawn at random."

After more than thirty years of calculating, deriving, explaining and publishing "standard errors" and their ilk, I have come to the conclusion that I don't know what they mean and I doubt seriously that they mean anything like what they are protrayed as meaning.

Consider this: if the population to which inference is made is one that is conceptually like the sample, then the population is just the sample writ large and the "standard error" is much larger than it ought to be. If you show me 25 adolescent largely Anglo-Saxon boys who love sports and ask me the population from which they could conceivably have been sampled, I'll conceive of an "infinite" population of such boys. If no population has actually been sampled and all I know about the situation before me is the sample, then I will conceive of a population like the sample. This is surely the very opposite of inference and standard errors are surely beside the point.

Consider something even more troubling: I present you with a sample-- Florida, Alabama, Tennessee, South Carolina. N=4. I calcualte the state high school graduation rates, average them and calculate a standard error. What is the population? States in the Southern U.S.? Fine; that's certainly conceivable, even if not "infinite." But suppose that someone else conceives of "States in the U.S." Well, that's conceivable too. But it is surely ridiculous to think that these four states can be used to infer to both of these conceivable populations with equal accuracy (standard errors). Or to make matters worse, suppose that I suddenly produce a fifth "state": Alberta. Now it raises the question whether the conceptual population is "geo-political units in North America"-- or the entire Western Hemisphere.

I can't imagine that there is much wisdom in attaching a number accurate to two decimal places when we can't even be certain whether it is referring to an "inference" to the Southern U.S., North America or the Western Hemisphere.

Now, if you think I am playing with your head and will suggest a way out of this dilemma that rescues the business of statistical inference for us, let me assure you that I have no solution. In spite of the fact that I have written stat texts and made money off of this stuff for some 25 years, I can't see any salvation for 90% of what we do in inferential stats. If there is no ACTUAL probabilistic sampling (or randomization) of units from a defined population, then I can't see that standard errors (or t-tests or F-tests or any of the rest) make any sense.

Does any of this apply to TVAAS? Just this. If one is worried about "stability" (in any of the many senses in which the word could be interpreted) then why not simply compare teachers' scores across all years for which data are available. That would answer in very straightforward ways whether the ranking of teachers jumps around wildly for whatever reasons or is relatively steady.

(I hasten to add that I don't approve of such things as ranking teachers with respect to their students' test scores.)

From: Harvey Goldstein

Well, I enjoyed Greg Camilli's imaginary conversation, but of course the reality is that standard errors are not things statisticians invented to make life difficult. Most non-statisticians have little difficulty in understanding that if you only have a measurement on 1 student there ain't much to be said about the rest. The bigger the sample the more confident you become that what you have observed is a good guide to what you would get on repeated samples with also suitably large numbers...assuming of course that you adopted a sensible randomly based sampling strategy.
Now we come to the philosophical bit. Social statisticians are pretty much forced to adopt the notion of a 'superpopulation' when attempting to generalise the results of an analysis. If you want to be strict about things then the relationship you discovered between parental education and student achievement back in 1992 from a sample of 50 elementary schools in Florida can only give you information about the physically real population of Florida schools in 1992. Usually we are not interested just in such history, but in rather more general statements that pertain to schools now and in the future...we may be wrong of course and that is why we strive to replicate over time and place etc. BUT the point is that, getting back to value added estimates for a school, if we want to make a general statement about an institution we do have to make some kind of superpopulation assumption....what we happen to observe for the students we have studied is a reflection of what the school has done, and would have done, for a bunch of students, given their measured characteristics such as initial achievement. The more students we measure the more accurate we can be and that's why we need an estimate of uncertainty (standard error).

From: Harvey Goldstein

Gene Glass also takes me to task on standard errors and raises the interesting question of when a sample should be considered as having a reference population and when not. There is no general answer...it depends on what you want to do. As I said in my response to Greg, I cannot easily see how you can have empirical social science without assuming that the units (people, schools etc) you happen to have measured are representative (in the usual statistical sense) of a (yes) hypothetical population whose members exhibit relationships you want to estimate. Such populations must (I think) be hypothetical because they have to embrace the present and future as well as the past when the data were collected. The issue is therefore the general philosophical issue and not a statistical one - statisticians simply try to provide tools for making inferences about such populations.

From: Greg Camilli

Harvey reponded to a previous post with the following:

>Well, I enjoyed Greg Camilli's imaginary conversation, but of
>course the reality is that standard errors are not things
>statisticians invented to make life difficult. Most
>non-statisticians have little difficulty in understanding that
>if you only have a measurement on 1 student there ain't much
>to be said about the rest. The bigger the sample the more
>confident you become that what you have observed is a good
>guide to what you would get on repeated samples with also
>suitably large numbers...assuming of course that you adopted a
>sensible randomly based sampling strategy.

Smaller is better, I agree. Another issue is whether it is the correct standard error, and still another is whether the SE has a meaningful referent. If the sample consists of all kids in the system, how can imagining a larger group possibly create more information. If I want to understand the behavior of my three car (I wish), how would it benefit me to imagine I had a fourth? IMO, this is not a statistical issue at all. Population has always been a heuristic device.

>Now we come to the philosophical bit. Social statisticians are
>pretty much forced to adopt the notion of a 'superpopulation'
>when attempting to generalise the results of an analysis. If
>you want to be strict about things then the relationship you
>discovered between parental education and student achievement
>back in 1992 from a sample of 50 elementary schools in Florida
>can only give you information about the physically real
>population of Florida schools in 1992. Usually we are not

Your option is to generalize to physically unreal schools.

>interested just in such history, but in rather more general
>statements that pertain to schools now and in the future...we
>may be wrong of course and that is why we strive to replicate
>over time and place etc. BUT the point is that, getting back
>to value added estimates for a school, if we want to make a
>general statement about an institution we do have to make some
>kind of superpopulation assumption....what we happen to
>observe for the students we have studied is a reflection of
>what the school has done, and would have done, for a bunch of
>students, given their measured characteristics such as initial
>achievement. The more students we measure the more accurate we
>can be and that's why we need an estimate of uncertainty
>(standard error).

I think you said in your ensuing post that this really isn't a statistical issue. Generalizing beyond known populations is risky business, and requires more than statistical knowledge. This was the focus of the long and interesting dialogue between Cronbach and Campbell. Standard errors have something to do with the precision of estimates. Perhaps they convey something about how well a model fits certain data. You might want to argue, on this basis, that the model is likey (or not) to generalize; but model fit at one instant does not *logically imply* model fit one second later.

The standard errors will apparently be used to measure whether statistically significant progress is being made by schools that fail to meet the standard (whatever that turns out to be), so it is important to be clear about what SEs mean. I find it fascinating that they are being used as policy tools with legal implications. In this regard, it is important to understand what drives the SEs. I'm guessing that missing data will add to SEs (it really would be helpful if the TVAAS staff would respond), and am sure that unit size will decrease SEs. Thus, standard errors for schools will typically be smaller for districts than for schools than for teachers than for students. As far as I can tell, only certain districts are required to make statistically significant progress; this may turn out to be a pretty easy criterion to satisfy.

From: Leslie McLean
Subject: The law demands std. errors

The discussion of standard errors has gotten so involved that a look at the Tennessee legislation should tell us where standard errors are needed and what interpretations reasonable people ought to be able to put on them. Below, the text is from Sherman Dorn's post and Les McLean's comments are in upper case letters.

(b) The goal is for all school districts to have mean gain for each measurable academic subject within each grade greater than or equal to the gain of the national norms.

HOW WILL ANYONE DECIDE WHETHER THE MEAN GAIN IS GREATER THAN OR EQUAL TO THE GAIN OF THE NATIONAL NORMS? PUBLICATION OF "STANDARD ERRORS" MUST MEAN THAT AN ERROR BOUND WILL BE ESTABLISHED AROUND THE NATIONAL NORMS--PERHAPS 1.5 TIMES THE MEDIAN STD. ERROR PER GRADE--ONE "HARVEY", OR 2.0 STD. ERRORS--ONE "DORN".

OK, GANG, THE VEIL IS LIFTED FROM OUR EYES--THERE IS NO SUCH THING AS "STATISTICALLY SIGNIFICANT PROGRESS" WITHOUT STANDARD ERRORS AND THE ASSUMPTION OF SAMPLES FROM SOME POPULATION.

Schools or school districts which do not achieve the required rate of progress may be placed on probation as provided in $49-1-602. If national norms are not available then the levels of expected gain will be setupon the recommendation of the commissioner with the approval of the state board.

YO, COMMISH! I DO NOT ENVY YOU YOUR TASK.

(a) Value added assessment means:

(1) A statistical system for educational outcome assessment which uses measures of student learning to enable the estimation of teacher, school, and school district statistical distributions; and

I COULD WRITE A RATIONALE FOR A "STATISTICAL SYSTEM" THAT DID NOT NEED STANDARD ERRORS, GIVEN THAT THEY TEST ALL THE STUDENTS. IT WOULD CONTAIN CAREFUL, MODERN DESCRIPTIVE STATISTICS THAT WOULD GLADDEN JOHN TUKEY'S HEART.

(a) On or before July 1, 1995, and annually thereafter data from the TCAP tests, or their future replacements, will be used (NOTICE THE 'WILL'-- THE LANGUAGE IS NOT JUST PERMISSIVE HERE) to provide an estimate of the statistical distribution of teacher effects on the educational progress of students within school districts for grades three (3) through eight (8).

HERE WE ARE AGAIN--THESE GAINS ARE TO BE INTERPRETED AS "TEACHER EFFECTS". PEACE, TVAAS, BUT I DO NOT BELIEVE THAT ANYONE'S MODELS AND TECHNIQUES ARE YET GOOD ENOUGH TO ISOLATE THE TEACHER EFFECT FROM ALL THE OTHER EFFECTS ON STANDARDIZED TEST SCORES IN SCHOOLS WITH ALL THEIR COMPLEXITY. NEXT TO THIS CONCERN--IT IS A CONCERN ABOUT VALIDITY AND IS NOT VAGUE OR COMPLEX--THE DEFINITION AND ESTIMATION OF STANDARD ERRORS IS TOO SMALL A MATTER TO TAKE OUR TIME.

From: Gene Glass

I quite agree with Les McLean that set over against questions of the validity of TVAAS "teacher effects" the matter of "standard errors" is of lesser importance.

Here's a validity question. I have asked it twice here--so that TVAAS could address it, but have not seen an answer.

A passage from the law makes it highly likely that what TVAAS does is measure pre-year achievement and partial (via some form of least-squares estimation) it out of post-year achievement for each teacher's class. Although this makes the resulting "gains" uncorrelated with pre-year achievement, it does not make them uncorrelated with pre-year ability-- abiltiy and achievement not being the same thing.

Consequently, teachers working with classes of different average ability can not be said to be evaluated fairly.

(This short discussion doesn't even raise very serious matters of "errors of measurement" in any pre measure that will also lead to a breakdown in "equating" the teachers at the start of the year. Nor does it raise questions of how a third grade teacher can teach skills or understandings that won't come to fruition on tests until the fourth and fifth grade. Does the TVAAS model really detect this and allocate proper credit to the source of the knowledge? And so on through similar concerns about validity.)

From: Sherman Dorn

The TVAAS staff noted, accurately, that the 1992 legislation is inclusive rather than prescriptive in how it describes the role of TVAAS in evaluation. Nothing forbids the state from using additional statistics as part of evaluation, either on the system level or for personnel evaluation. Nonetheless, I think it is clear from the text of the law that TVAAS was always intended to serve a central (and, from my reading, THE central) role in program evaluation, and that the threat of probation and public censure from TVAAS is the major stick in the accountability system in Tennessee. Consider these aspects of the legislation:

        (1) Only value added assessment is rigorously defined in the
        legislation, through citations to peer-reviewed
        literature and definitions of threshold levels for students'
        inclusion in teacher effects.  (By contrast, the law talks about
        attendance and dropout rates without ever defining them, or
        how to apportion responsibility for students.)
 
        (2) The only defined standard in the law (that of meeting
        national norm gains where available) is directly related to
        TVAAS.
 
        (3) Only three topics of measurement (TVAAS, attendance,
        and dropping out) were mentioned in the legislation.

Now, it may be that school systems, and the Board of Education and new Commissioner of Education, will create all sorts of additional statistics to be part of program and personnel evaluation. But from where I sit on this drizzly night, I think that's unlikely -- and, moreover, the legislation as it stands currently sends a very powerful signal, one that reinforces the high-stakes nature of the current set of annual tests in Tennessee.

From: Harvey Goldstein

Les McLean's comments have inspired some more thoughts.
In the simplest value added model, an outcome score is regressed on an input score so that generally each school will have a different regression line - perhaps with varying slopes but in the basic model with parallel slopes so that schools can then be ranked on the resulting regression intercepts. (The actual analysis is a bit more complex but this simple model captures the essence). We find, typically, that the variation among these intercepts is relatively small compared to the residual variation of student scores about the regression lines for each school (5% - 30% depending on which educational system you are studying). In addition, the regression itself will account for quite a lot of the variation in outcome...maybe as much as 50-60%. This means that there is a substantial remaining variation (among students) unnacounted for and it is this residual variation which determines the standard error values. Thus, e.g. if this residual variation was zero, we would exactly predict each schools (relative) mean and the standard error of that prediction would be zero. This would mean also that once we knew each student's input score (and anything else we were able to put into our regression model) and the school that student was in we would have a perfect prediction of the student's outcome. Of course, we are nowhere near that situation and it is this uncertainty about the individual prediction that translates into uncertainty about the school mean (think of the mean roughly as the average of the student residuals about the regression line for each school). If you took another bunch of students with exactly the same set of intake scores you would NOT therefore expect to get the same set of outcome scores - this is what the uncertainty implies - nor the same mean for the school. In the absence of being able to predict with certainty we have to postulate some underlying value for each school's mean (otherwise we are pretty well lost) which we can think of as the limit of a series of conceptual allocations of students to the school. Thus an estimate of uncertainty, conventionally supplied by calculating the appropriate standard error,is important if you want to make any inference about whether the underlying means are different and, more importantly, to set limts (confidence intervals e.g.) around the estimated difference for any two schools or around the difference between a school's estimate and some national norm. Hence my original remark some time ago that when you did just that you found that most institutions could not statistically be separated, and I suspect also for TVAA that very many cannot statistically be separated from a National norm, whether they are actually above or below it.
It would be good to hear from the TVAA people on this issue.

From: Rick Garlikov

Since Sherman and others read the Tennessee law he quoted the way they do, it IS extremely important for the state of Tennessee to make clear that is NOT the way it should be read.

That being said, it is either through different experiences or just a mere quirk of psychology, but *I* don't get the same meaning out of what Sherman quoted as he does, and as others seem to. Even TVAAS, in their commentary on the law, did not point out what I think is the single most important word -- "rebuttable". I don't have the language available at this time, but the phrase was something like "It is the rebuttable presumption that a school district should have [average gain scores...]..." and that if they don't, AND don't seem to be able to AND if there doesn't seem to be any compelling reason (i.e. rebuttal) why they don't, THEN ....

Further, the consequence then is that the state superintendent (or whoever -- I forget) MAY recommend to the state school board that people be removed from offices, and the State School Board has to concur with that recommendation.

If Tennessee is anything like Alabama in these sorts of matters, it would take an act of God for all these things to happen in a way that would actually end up in somebody's removal -- regardless of how abysmal their district's test scores are. There would be a bunch of excuses given that would count as "rebuttal" whether they were or not.

So, not only do I not see the language's saying what Sherman sees, but I don't see the POLITICAL reality's being what Sherman sees, though I understand his expectations sometimes are met -- and though I see that WITHOUT explicit and REPEATED policy statements to the contrary there can be various sorts of pressures to have tests drive the curriculum and instruction in undesirable ways. Where Sherman and I disagree is whether tests will drive the curriculum even if there is a clear policy STATEMENT to the contrary.

From: Leslie McLean

First: Please dissociate names of persons from std. errors (Harveys and the like). When I think of it, I wouldn't like to be named for an error either. Consider this one: -1 = One Les.

Second: Harvey Goldstein's exposition on standard errors (17 Jan, "Standard Errors: yet again") may have been more than some wanted, but I found it instructive and thought-provoking. If you deleted without reading, reconsider--it gets at the heart of the matter of TVAAS.
While still wanting to retain the concept of the sample from some (unspecified) population, Harvey's main lesson for us was to highlight the crucial role of the model adopted by the statistician in estimating scores--gain scores, in the case of TVAAS. A model is a formula that the statistician considers a reasonable try at relating the desired quantity, the 'gain' in achievement (not directly measurable because of nuisances such as social class and prior learning) to aspects of schooling, such as teacher competence.

Advised by statisticians with wide experience outside of education (and maybe in education--we have not been told), the policy-makers decide to give the statisticians their head and to accept their estimate of 'gain', knowing that the formula will be complex and the procedures well beyond the understanding of all but a very few. The statisticians make a persuasive case that their formulae and their procedures will provide the policy-makers with an estimate of gain that will distinguish the bad teachers from the poor from the average from the good from the excellent. "National norms" are invoked, unspecified, but responsibility given to the Commissioner of Education to provide norms if the national government lets the side down.
All this tedious repetition is needed to give a context for Harvey Goldstein's description of standard errors. In essence (correction, Harvey, please, if needed) the errors are S&E, not SE--errors of Specification & Estimation, not of sampling. A 'specification' error is made when our model, our formula, does not accurately link the target (the gain) with the data (the item responses or scale scores plus proxies for prior learning and social class and the like). We ALWAYS made a specification error--the only question is how large. If we limit ourselves, as in the TVAAS, to linear models, and we try to estimate gains across big, complex societies such as states, the error can be huge--and there is not consensus how to estimate the size of the error. Here is a source of error.
Even though they do not sample students and schools, sampling cannot be avoided--people are absent, times of testing vary, the tests cannot possibly cover all the content (hence content sampling), items are omitted, test booklets get lost, some teachers do not cover the material on the test, ..., and so on and so on and so on. This is why we do not use a very simple formula:
Gain = (Avg. score end - Avg. score beginning)
After all, when we test everyone, and when the goal is to measure gains by these students this year in these places with these teachers, who needs an error term? With well-constructed tests, the measurement errors will cancel out when we calculate school and class means. Oh--there is measurement error in individual pupil scores, but we can report that (from the test publisher's manual) and besides, these scores don't count in the student's grade--the teacher does not get them in time, and even if they do they do not use them.

Ok, so I seem to have lost the tenuous thread of the argument--NOT SO! We have learned over the years that the simple formula is more likely to mislead than to lead--to distort our view of gain rather than to clarify it. Raw score comparison tables (called 'League Tables' in the UK, after the rankings of sports teams), however compelling they seem, are statistically invalid, immoral, racist, sexist and stupid. Apart from those few flaws, they are fine. But would Tennessee put up with such poor procedures? Not on your life--scaling, imputation, hierarchical linear models and prayer are brought into play. Here is another source of error.

Pick up the tread again, the two of you who are still reading. All this talk of standard errors and models and politics keeps coming back to one key aspect: VALIDITY. Do those numbers represent gains in achievement? The formulas and procedures are complex enough that evidence is needed. Even if they do, how accurate are they--and I mean how much do they tell us about better learning, class-by-class, teacher-by-teacher; or has the TVAAS traded in science for voodoo? Without a better explanation, the use of these scores to label teachers as competent or incompetent seems a lot like sticking pins in dolls.

It is possible to validate the numbers--but it would take a lot of thinking, a lot of hard work and maybe 0.01 of the budget of TVAAS.

From: Greg Camilli

Harvey Goldstein wrote:

>In the absence of being able to predict with certainty
>we have to postulate some underlying value for each
>school's mean (otherwise we are pretty well lost) which
>we can think of as the limit of a series of conceptual
>allocations of students to the school.

I think we're lost when we accept statistical inferences based on data that weren't observed, and moreover, do not exist conceptually. If "all the students in the school" doesn't really have that meaning, then we are playing a game with language.

>Thus an estimate
>of uncertainty, conventionally supplied by calculating
>the appropriate standard error,is important if you want
>to make any inference about whether the underlying means
>are different and, more importantly, to set limts
>(confidence intervals e.g.) around the estimated
>difference for any two schools or around the difference
>between a school's estimate and some national norm.
>Hence my original remark some time ago that when you did
>just that you found that most institutions could not
>statistically be separated, and I suspect also for TVAA
>that very many cannot statistically be separated from a
>National norm, whether they are actually above or below

If we can get away from the superpopulation for a moment, we can begin to analyze what drives the standard error. It certainly isn't sampling error; nonetheless, it is a quantity that exists in a real sense. As you've implied above, SEs have something to do with model fit. Thus, we should be interested in those things that cause models to fit more loosely to the data. District size is certainly one factor; but correlation of effects within the model will also inflate SEs. Effects like teachers within schools, teachers with school, schools with district might be some examples. As Gene implies, separating these effects may take some doing.

From: Greg Camilli

Rick G.:
>  Greg says that his main concern is "how testing programs shape
>school culture and curricula."  Greg, cannot school culture and curricula
>also shape testing programs?  And couldn't the dominant cause and effect
>relationship flow in that direction, more than in the direction of the
>tests driving curriculum and instruction?  If not, why not?

For a number of reasons, testing programs usually affect schools, and not the reverse. First, testing programs are *intended* to shape curricula. Second, large-scale tests are mainly written to assess only skills that cut across curricula, perhaps with the message "what should be taught." Thus, the individual character of schools and districts can get lost. Maybe you mean "But isn't it possible that they affect testing program?" to which I would answer maybe it happens somewhere, in some superpopulation, but not where I come from. Perhaps in an ideal world.

Maybe you ask "Wouldn't it be possible to change testing programs so that they address the needs of schools" to which I would reply "Yes." But I hasten to add that there is a distinction between local and global testing. I'm sure a lot of really good district assessment programs are in place. If you imply that far from being agents of change, global testing programs may themselves need to be reformed, then I would also agree.

From: Rick Garlikov
Subject: Response from Bill Sanders, TVAAS

From TVAAS and Bill Sanders. Regarding TVAAS:
Several recent queries have dealt with the model(s) used in TVAAS. Anyone interested in learning specifically how the model--in particular, the teacher model--is defined and how the estimates are obtained, there is an explicit definition in the paper cited in an earlier post, "The Tennessee Value-Added Assessment System (TVAAS): Mixed-Model Methodology in Educational Assessment" (Sanders and Horn, _Journal of Personnel Evaluation in Education_ 8:299-311, 1994) on pages 305-309. For those of you interested in how standard errors are obtained, property 2(c) on page 306 of the same article details that information. If you are unable to obtain a copy of this article, you may write to the UT Value-Added Research and Assessment Center, P.O.Box 1071, Knoxville, TN 37901-1071.

To Gene Glass: I hope that reading this article will give you a more accurate picture of what we are doing. After you have looked at it, you may want to restate your questions.

To Leslie McLean: Your plots of standard errors as calculated make no sense. Middle schools in the example school system we provided have more students than intermediate schools in almost every case. Thus, their standard errors tend to be lower. Middle schools also have smaller expected nominal gains. Therefore, your attempt to show a relationship over grades is nonsense.

To Harvey Goldstein: Thank God for you and your insights. I am glad to learn that you in the UK are obtaining comparable sensitivities using similar approaches. As evidenced by the estimate mean gain and the relatively small standard errors, clearly many schools can be distinguished as deviations from the average school within a district. Even though this is not the objective of TVAAS, it does show relative sensitivity. To those of you worrying about conclusions reached from a specific type of test, let me share with you recent findings (as yet unpublished) resulting from the merging of data from assessment instruments other than TCAP into the master database at UTVARAC. We have recently completed merging the 10th grade PLAN (previously known as the PACT) and the 12th grade ACT scores for the last three years and the Tennessee Writing Assessment data into the master database. This database is now comprised of more than three million records. What we have found is that the differences among school systems in the scores of 10th and 12th graders are huge, even after holding 8th grade achievement level constant. Further, the findings are relatively consistant regardless of the test data used. Additionally, writing Assessment data is being analyzed in conjunction with the more traditional forms of testing to evaluate how much unique information is available from the writing assessment. This work will continue, and we will be writing and reporting more of it in the next several months. To those of you who are concerned about the understanding of the TVAAS reports by Tennessee educators, let me share observations based on scores of phone calls and numerous presentations across our state. Educators' understanding and attitudes vary widely. In those systems in which superintendents, supervisors of instruction, and/or principals have worked to learn and share the diagnostic value of the analyses, positive atitudes and progressive plans are leading to improved academic growth opportunities for thousands of Tennessee youngsters.

As educators strive to improve gains, they have identified the following practices as some of the primary impediments to be overcome: 1) excessive re-teaching; 2) failure to communicate over grades; 3) the enormous effects of building change (see Sanders, W. L., et al.(1994). Effects of Building Change on Indicators of Student Academic Growth, Evaluation Perspectives_. 4.3-7); 4) "lock-stepping" instruction to the detriment of many high achieving students, etc. etc.

Finally, for those of you who have complained about a timely response to technical matters relating to TVAAS, if you could have forwarded a magic cure for flu to me two weeks ago, then my responses would have been far more punctual.

As I stated in my original response to Dorn, we will respond to legitimate criticism from all of you to the best of our ability. That still stands. However, I do not feel it is necessary for me to rewrite and post to this forum those things which have been previously published and rigorously reviewed. We will furnish citations, instead.

From: Greg Camilli

Les,
I think your distinction between SE and S&E is a clear and elegant statement. It is a must-read for anyone interested in how statistical models are likely to behave in policy contexts. I'd like to throw in two additional cents:

1. Because some statistical models are complex, and understood by few, it is ironic that this initially evokes more (rather than less) credibility. The downside (or upside depending on how you look at it) is that when a small crack in the model's facade appears, the public and policy makers can be very unforgiving. A relatively small equating anomaly in a New Jersey state test nearly caused the demise of the testing program. Moreover, when such a crack can be patched, it creates an atmosphere in which technical personnel are *less* motivated to diagnose future problems.

I think TVAAS is certain to encounter a related problem with its "linear metric." How is it, the press may ask, that gains are so much larger in the earlier than the later grades? Does this mean that students aren't learning very much in high school? Moreover, because the standard errors are likely to be different across districts, larger districts might have to achieve smaller gains to be consistent with the law. Does this imply different standards for different districts? (I recognize that larger districts have to pull up more kids to achieve a SE's worth of gain -- but I'm not sure this type of argument would wash since a SE may be only a baby step toward the national average.)

2. The "natural" sample that exists on any given day does, I suppose, give rise to a superpopulation of the sort that Harvey Goldstein writes of. However, this is not the population about which most people think of when evaluating gains since, as Bill Hunter points out, it is not a random sample from the school's student body.

From: Bill Hunter

Per Greg Camilli:

> 2. The "natural" sample that exists on any given day does, I suppose,
>    gives rise to a superpopulation of the sort that Harvey Goldstein
>    writes of. However, this is not the population about which most
>    people think of when evaluating gains since, as Bill Hunter points
>    out, it is not a random sample from the school's student body.

I need to clarify a bit. I think it is not the case that a sample of convenience "gives rise to" or "implies" a population of any sort (unless one chooses to regard the sample _as_ a population). As far as I can tell this thinking is exactly backwards--samples derive their meaning and existence from populations: I cannot see that the reverse order has any meaning at all. I also question the utility of Harvey G.'s conception of such samples as samples from a population in time. This _might_ make sense in a time/space of great stability, but I see little reason to believe that children four or five years from now will have experiences of the world (especially the world of information) that is comparable to children of today (or five years past). The kinds of changes that required revision and re-norming of intelligence tests every 15 or 20 years half a century ago now take place in five years or less--probably about the same time scale that would be required to conscientiously develop and renorm the test.

Moreover, I think it is not just that such a sample is not a random sample from some _specific_ population (as Greg suggests above), but that it is not a random sample of ANY population for two reasons: 1) the process of selection did not insure equal and independent likelihood of selection for all members of the population and, more importantly, 2) no population was specified (to which the above process was not applied) (Sorry about the double negative. I'll have to stop watching Seinfeld.)

From: Leslie McLean
Subject: Error plots: clarification

On January 18, Bill Sanders wrote (via Rick Garlikov--and along with many other topics):

>To Leslie McLean:  Your plots of standard errors as calculated make
>no sense.  Middle schools in the example school system we
>provided have more students than intermediate schools in almost
>every case.  Thus, their standard errors tend to be lower.  Middle
>schools also have smaller expected nominal gains.  Therefore, your
>attempt to show a relationship over grades is nonsense.

It was indeed the point I was making--that the plot (or correlation) over grades made no sense. That is why I argued that the within-grade correlations were the ones to look at--and that they were around 0.0. BTW, if means in a table are based on widely different Ns, you would do your readers a good turn to say so, don't you think? Your remark that "middle schools also have smaller expected nominal gains" is ambiguous and interesting. In what sense "expected"; in what sense "nominal"?

From: Sherman Dorn

>The Tennessee's 1992 Educational Improvement Act lists several
>references pertaining to the technical aspects of TVAAS's "mixed
>model methodology" for value-added learning assessment.  In addition
>to the citation coauthored by Professor Sanders (posted 1/18/95) are
>the following:

According to both Sanders & Horn (1994) and MacLean et al. (1991), the Henderson article is the root of the mixed model Sanders and MacLean developed. (I also think Sanders mentioned this in a previous post.) I would hope that someone else, preferably with some more appropriate statistical background than I, would take the time to read the Sanders & Horn, MacLean et al. (American Statistician, 1991), and perhaps Henderson or the SAS Stat chapter on their PROC MIXED in order to help us here.

Conceptually, I gather the rationale for using a mixed model on gain scores is as follows: using raw scores is unfair to evaluate teachers because of initi capacity of students to perform on tests. Using just gain scores is better, but one needs to view various "effects" as random because of (a) individual students' varying responses to a teacher, and (b) measurement error. According to Sanders and Horn, the mixed model solves the problem of regression to the mean and gain scores through the viewing of effects as random. They also claim that the calculus behind TVAAS is able to use partially-censored student records, eliminating the problems with complete-data analysis when, for example, Joey had the flu on the day his school was conducting the math tests and thus all scores for him would be thrown out.

My statistical concerns about this are less with the model (though I've tried to puzzle through the matrix algebra) than with how it connects with real-world schooling and learning. Gene noted a long time ago his supposition that one would expect low-performing students on tests to have lower gains than students who initially perform highly on the tests. TVAAS folks have vigorously disputed this, claiming that their records indicate low-performing students can expect, on average, equivalent gains to higher-performing students. I am assuming, though I have not asked, that this information is individual student scores, rather than an ecological view through teacher or school effects measured against average initial scale scores.

My concern is the use of gain in SCALE scores, which are derived from a cross-sectional 1989 norming of the test used for TVAAS -- the CTBS. To put it simply, the third grade norming population is not the equivalent of the second grade norm population aged a year. (Not only do age cohorts' experiences differ, but flunking students and migration change the age-grade composition of grades.) Thus, we really don't know whether a scale score of, say, 700 on the third-grade CTBS would be the same thing as if the 1989 second grade had been tested a year later in the third grade. Probably both the mean and the standard deviation would be different, which means that scale scores would undergo some linear transformation. For this reason, as well as Gene's concern, I have suggested to UTVARAC staff that they use previous year scores as an independent variable, with this year's scores as the dependent variable, rather than the gain score approach. According to them, adding a variable and rerunning the entire system would not be very onerous at all. Thus far, UTVARAC staff have not responded to this suggestion.

=========================================================================

Sherman Dorn:
Now, as an historian, I would be remiss if I didn't point out that Tennessee already had high-stakes testing well before 1992 and the Educational Improvement Act. And TVAAS is not, in itself, a test. It is a statistical system for analyzing test scores. My concern is that TVAAS' existence will cement in place any mutual distrust between teachers and policymakers that currently exists, and exacerbate organizational problems that TVAAS cannot respond to. I should note, also, that with Lamar Alexander as governor in the 1980s, Tennessee was at the forefront of the earlier wave of school reform, with higher graduation standards, the career ladder program, and the first wave of high-stakes testing. As with the others, I doubt that TVAAS will facilitate much improvement, in its current form at the center of educational evaluation.
Sherman Dorn
Vanderbilt University

=========================================================================

From: Gene Glass
Subject: Glass Replies to William Sanders about TVAAS

On Wednesday, January 18th, William Sanders wrote the following in response to my repeated requests that he address, among other questions, the matter of how the TVAAS deals with ability differences in students who happen to appear in particular teachers' classes:

"To Gene Glass: I hope that reading this article will give you a more accurate picture of what we are doing. After you have looked at it, you may want to restate your questions." (Sanders)

I don't appreciate being fobbed off in this manner and I am in an even more atrabilious mood for having traipsed to the library to copy the article referred to (Journal of Personnel Evaluation in Education 8:299-311, 1994). It contains no adequate information on the question I asked. Indeed, it contains six superficial pages on teacher evaluation and a four-page statement of a completely unextraordinary mixed effects linear model. (I too, Mr. Sanders, can write mixed effects models--Glass & Hopkins, Statistical Methods in Education & Psychology, 2nd Edition, 1984, pp. 465-474; so don't imagine that I am intimidated when someone flashes four pages of equations in my face instead of being responsive to my inquiry.)

The question is whether the TVAAS system takes adequate account of differences in student, and hence, school class ability (as measured by generally accepted ABILITY tests). An examination of Sanders and Horn's article reveals clearly that it does not; I repeat, it does not. In my opinion, the implementation of the system is unfair to teachers for this reason, and very likely other reasons.

The Sanders and Horn article makes a few references to student ability:

1. An early 1980s application of the method "rendered the following findings: .... 5. Student gains were not related to the ability of achievement levels of the students when they entered the classroom." (p. 300) The documentation for this peculiar finding (which if taken seriously would, of course, imply that bright and dull students make the same achievement gains in a school year--a palpable and self- contradictory absurdity) is an unpublished report (Working Paper No. 199) from the UT College of Business Administration (McLean & Sanders, 1984). (It is an irony in need of clarification that the McLean with whom Sanders has collaborated on some statistics articles is not the McLean (Les) who has here taken serious issue with the TVAAS approach.) I don't consider unpublished working papers as adequate documentation for so extraordinary an assertion. They are unreviewed and unpublished. So reference to the article is no more helpful than is repeating the unsubstantiated assertion made weeks ago here that student ability is unrelated to teacher effects in the TVAAS system. (How, I might ask, does TVAAS imagine that the positive correlation between ability and achievement arises in the world if more able students do not learn at a faster rate than less able ones?)

2. In the section of the paper entitled "Problems of using Student Achievement Data in Educational Assessment," Sanders and Horn write: "Since random assignment of students to teachers is usually not practiced and seldom is possible, simple means of class achievement test scores are seriously biased by many factors other than teacher influences that affect student learning. Travers (1981) listed (1) teachers influences, (2) parental influences, (3) genetic endowment, (4) other school influences, and (5) availability of materials as being some of the most important factors that determine the rate of student learning." (p. 304) The reference to Travers (1981) is to a chapter (referenced by page numbers or title) in the Handbook of Teacher Evaluation edited by Jason Millman, not "Millmen" as Sanders and Horn report.

3. Sanders and Horn go on to write at the bottom of page 304 that "Obviously, any system that will fairly and reliably assess the influences of teachers on student learning must partition teacher effects from these and other factors." The reference to "these and other factors " being to both Travers's list and Bingham et al.'s unremarkable assertion that student test scores are influenced by many things including family characteristics, personal characteristics and the like. They go on: "However, it is a hopeless impossibility for any school system to have all the data for each child in appropriate form to FILTER (emphasis added) all of these confounding influences via traditional statistical analysis." (p. 304)
We learn two things here: 1) that Sanders and Horn will soon tell us that they are not going to employ any measures of these student background characteristics in their system and that they regard their statistical analysis as more than merely "traditional."

4. The next paragraph delivers the news: "Using a different approach, the three studies conducted by Sanders indicate that these influences can be FILTERED (emphasis added again) without having to have direct measures of all of the concomitant variables. By focusing on measures of academic gain, each student serves as his or her own 'control'--or in other words, each child can be thought of as a 'blocking factor' that enables the estimation of school system, school, and teacher effects on the academic gain with the need for few, if any, of the exogenous variables." (305) This is an incredible and patently ridiculous claim. It says that "gain scores" gotten by correcting posttest achievement scores via least-squares mixed model estimates from pretest achievement scores can be safely assumed to control for "exogenous" (merely, in this context, "not measured") variables such as social class, race, intelligence, culture and many others. It is said that pretest achievement scores will be the "filter" through which these exogenous influences operate on the posttest scores, hence, the student background characteristics can be ignored. If this is an assumption of the system, it is clearly contestable and nearly certainly false; if it is a belief about what accounts for variation in achievement scores, it is incompetent.

5. They go on: "In an attempt to partition the teacher and school effects from the partial confounding with class ability level, the well-known linear model techniques of analysis of covariance and ordinary multiple regression have been suggested by Millman (1981) and others. The obvious intent was to adjust differences that exist among students to enable a fairer evaluation of teachers. However, if these simple approaches are applied, and even if all of the concomitant data were available, still unanswered is the well-known problem of regression to the mean of the teacher effects that would provide unfair rankings of teachers with varying quantities of student achievement records." (p. 305) Millman (spelled correctly this time) is cited but the bibliography contains only the citation "Millman, J (ed.). (1981). Handbook of Teacher Evaluation. Beverly Hills: SAGE." I presume that Sanders and Horn are referring to a specific chapter in the Handbook that Millman edited, though they don't cite one. I also assume that the chapter in question is in fact Millman's own chapter on using student test scores to evaluate teachers. When Millman and Darling- Hammond edited the Second Handbook on Teacher Evaluation, Jay Millman asked me to write the chapter on Using Student Test Scores to Evaluate Teachers. I will be happy to send an email copy of this chapter to anyone who requests it. As those who know me might guess, I had nothing good to say about the practice. But back to Sanders and Horn--the last sentence quoted above beginning with "However, ..." is simply unintelligible to me. What is clear, however, is that far from lacking "all of the concomitant data," the TVAAS has NONE of the concomitant data.

It is clear in the Sanders and Horn piece that they believe that their mixed model estimation procedure solves problems of unmeasured exogenous variables (ability, social class, race and the like) and provides a fair comparison among teachers: "If the problem (of estimating teacher effects) is viewed not as a fixed-effects problem but rather as a mixed-model problem with both fixed and random effects, then much established theory and methodology exist that offer solutions to many of the problems that have been cited as reasons for not doing educational outcome assessment from student achievement data." (p. 305) This is absolute nonsense. If information on intelligence of the class of students, their family life and the like is not measured and included in the model, it does not somehow magically appear compliments of solving the normal equations. The only way that such background influences can be assumed to enter the TVAAS system is through the "filter" of pretest scores; to think that this "filter" is sufficient to correct for the background characteristics is simple fantasy.

It is clear from reading the TVAAS responses to questions posed here over the past few weeks and from reading their latest published exposition of the method that they have no appreciation of the validity question whatsoever. Nor do they appear to have studied and taken into account the psychometric literature on these problems, which is abundant and well-known: Harris, C.W.(ed.) (1963) Problems in Measuring Change. Univ of Wisconsin Press; Cronbach, L.J and Furby, L. (1970). How do we measure "change"-- or should we? Psychological Bulletin, Vol. 74, 68-80. Cronbach, L.J. (1982) Designing evaluations of educational and social programs. Jossey-Bass. These are a minimum set of references for grappling with these problems.

From: Greg Camilli

Communication does seem to be a problem lately. Rick doesn't understand our differences (or those with Tom and John). I don't understand them, and Rick's latest communication is even more baffling to me. I also don't understand Harvey's position on superpopulations, though doubtlessly generations of statisticians assume this as axiomatic. (As for William Sanders, I wouldn't classify his post as serious communication at all. I will, however, read the articles in good faith.)

In a strange way, Rick and Harvey are saying something similar. Harvey talks about superpopulations; these are entities that don't exist, except in the imagination. Yet it is contended that it is a "reality in the sense that further batches of students are samples from it. How else would you make sense of anything?" A lot of people have sought to answer this question, among them Alan Birnbaum who paraphrased the likelihood principle as the "irrelevance of outcomes not actually observed." He went on the write of the "immediate and radical consequences for the everyday practice as well as the theory of informative inference.' As for the superpopulation, it exists in ones mind as a vehicle for generalization. But generalization itself requires more worldly knowledge. For example, consider the standard error of statistic calculated from a poll during an election. You might say a population exists, but only for a limited amount of time. Experience with the rate of change in public sentiment (and the way the question is asked) is required for a valid generalization. Happily, however, we are in full agreement on the role of specification error, as masterfully articulated by Les. (Because William Sanders thanked God for Harvey, one assumes he also agrees on this point. I'm also thankful for Harvey's participation.)

Rick has a world in mind where he suspects that the rest of us presume that "of course teachers will try to psych out the test, and they ought to do that for their own personal protection." Some of us "might even think that this is the best way to get good scores on such tests or that it is natural for many teachers to believe that -- even if they feel it is an improper way to teach. I have been assuming that it is NOT the best way to get test scores, and it is NOT the best way to teach." And further "In any case, it seems to me to be a dishonorable and a very strange for a profession as a group to acquiesce to a bad policy in such a way that they spend more effort coping with it than trying to correct it. You make it sound like the legislature, the media, and the public do not care whether tests are predictable or not. I don't think it is idealistic to think that the public's or media's interest is that unconcerned about this whole thing."

Rick, the educational profession cares that teachers do not teach to tests because it dilutes the content of the curriculum. Most people in testing programs also believe this. It seems to me that you are arguing with someone who thinks tests should be used to dumb down the curriculum. I believe as you believe, but not all you believe. I also think tests can be good or bad, taught to or not taught to, and that some may cheat while others struggle to maintain integrity. And I think that one can label anything as political. It's a term that is often used in lieu of a sensible explanation of how something has come to pass, and what can be done about it. Most people think this way. (There, I've done it, I've created a superpopulation.)

When I read your messages, it strikes me that you are reflecting on how people are thinking and the language they are using, rather than the content of the messages. The blanket characterization about educators is out of line, in my opinion. I think we are trying to expose and correct bad policies, that's what this discussion is about, and it is not without precedent. In 1988, John Cannell published an article "Nationally Normed Elementary Achievement Testing in America's Public Schools: How All 50 States Are Above the National Average." Mr. Cannell is a doctor (MD) not a measurement specialist. This article (whose gist is given in the title) stirred widespread attention inside and outside the academic community. The effect he noted is now widely called the "Lake Woebegone" effect. Anyone familiar with this article, knows the public is interested in good testing practice. The problem has been researched extensively, in both universities and testing companies. The results are clear: testing programs, whether the test is good or bad, can have unintentionally harmful effects.

Perhaps the communication problem results from a "type" mismatch of our arguments based on experience with your arguments based on a more technical sorting of language and logical form. Finally, I'm not bemoaning TVAAS. I don't yet know how well the program works, but I do ask for information and I am skeptical. Moreover, the TVAAS staff can probably learn more from our skepticism than warm congratulations on a job well done. They too are involved in a sorting language (say test scores) and logical form (say statistical model) and could benefit from our experience.

From: Michael Scriven Scriven@AOL.COM

Aha! At last the battle has been joined. I look forward to a response by Bill S. However, I do hope that both parties, or at least the spectators, can keep in mind the need to go forward with the basic question: is there a feasible method to get at least an approximate estimate of the extent to which teachers are contributing to student learning?

Whether Bill has got it or not, there's a worthwhile task here-getting the best feasible model. If it's not very good, that isn't important. As long as it's better than failing to bring in outcomes to teacher evaluation, and not too expensive, then it's worth having. Remember that even now no teacher is at risk on the basis of the TVAAS results alone: there has to be other confirmatory data. (Sure, we need to look at whether that gets biased by knowledge of the statistics results, but if not, it can easily be done independently.)

So I hope we can keep our collective eyes on the goal of getting an idea of whether any errors here are correctable to the point where we have a useful device for correcting the appalling alternative of judging teaching without reference to outcomes.

Michael Scriven

From: Greg Camilli
Subject: CTB Scale scores

I thought that some of you might want to take a look at some statistics regarding the metric of the scores that TVAAS uses. Below, I've given the mean, median and standard deviation of the IRT metric for fall reading comprehension as reported in the CTBS/4 Technical Bulletin 1 (1989).(I hope this isn't too far out of date.)

Grade    mean    median     STD
 
1        473     481        84.3
2        593     606        81.1
3        652     657        59.6
4        685     694        53.6
5        707     714        48.6
6        725     730        43.8
7        733     738        43.6
8        745     750        43.1
9        760     764        38.6
10       770     774        39.6
11       776     780        38.2
12       780     782        38.0

If you plot these data by grade, some interesting possibilities emerge. For example, one wonders why students below average gain as much as students above average. The explanation I see is that there is much less room for growth at higher grade levels, but this is a function of the scoring metric. A transformation of scale might lead to different results.

From: Sherman Dorn

Michael Scriven writes:

>Whether Bill has got it or not, there's a worthwhile task here-getting the
>best feasible model. If it's not very good, that isn't important. As long as
>it's better than failing to bring in outcomes to teacher evaluation, and not
>too expensive, then it's worth having.

This assumes that (a) there is no alternative to TVAAS to bring outcomes into teacher evaluation, and (b) those of us who criticize the present place of TVAAS in Tennessee official policy [and the potential for similar statistical systems elsewhere] are therefore against judging teachers by what students learn. Neither is true. (Counterexample: go read the issue of Exceptional Children, vol 52 [1986], on formative evaluation.)

Also, the state of Tennessee has spent millions over the past ten years, including the money spent to develop TVAAS, in having students complete annual high-stakes tests. I would call that rather expensive.

Furthermore, the state of Tennessee has, through its promotion of high-stakes testing, demonstrated to its teachers the profound mistrust policymakers have of them, further eroding the legitimacy in teachers' eyes of ALL attempts to judge them by students' outcomes. I would call that rather expensive.

>Remember that even now no teacher is
>at risk on the basis of the TVAAS results alone: there has to be other
>confirmatory data. (Sure, we need to look at whether that gets biased by
>knowledge of the statistics results, but if not, it can easily be done
>independently.)

What you've written reads to me as the following: "I don't care if TVAAS is a bad model. It's a model. Besides, it's a model that doesn't really matter. So let's go on working with it." This is beginning to sound like the kettle defense. ("I never touched the kettle; I only borrowed it; it was broken when I first took it.") I hope that's not what you meant.

From: Leslie McLean
Subject: Validity, again (no apology)

As I have come to understand the TVAAS, it goes like this:

These numbers are measures of gain in student achievement over a school year, and our PROCEDURE entitles you to read them as measures of teacher competence--the larger they are, the better the teacher, and v.v.

Many of us, based on lots of experience, are extremely skeptical of this claim, and so far the explanations have only increased our skepticism. The procedure is SO questionable (from test content to scaling method to model-fitting and adjustment) that any interpretation such as I have given above should be withheld pending supportive evidence--talk about "meaning and values in measurement and evaluation"!

As many posts have argued, the TVAAS numbers become the only evidence to get attention, not because the TVAAS folk say so but because the sheer size and bulk and prestige of the enterprise drive out other measures. Messick ended his speech with a quotation from one Liam Hudson (1972), from his book, The Cult of the Fact (p. 125). Hudson argued that social science:

     ...should be pictured not as a society of good men and true,
     harbouring the occasional malefactor, but rather, as one in
     which everyone is searching for sense; in which differences
     are largely those of temperament, tradition, allegiance and
     style; and in which transgression consists not so much in a
     clean break with professional ethics, as in an unusually high-
     handed, extreme or self-deceptive attempt to promote one
     particular view of reality at the expense of all others.

More strenuous efforts would appear to be required to avoid such a transgression in Tennessee.

From: Leslie McLean
Subject: More on validity (with apologies)

All the posts about statistics, including Gene Glass's response to Bill Sanders, may have left some of you wondering whether it all matters, or what it means. Michael Scriven reminds us that "outcomes" (by which he means what students learn) must not be left out of the data when the competence of teachers are assessed. He gives us a version of what I heard Patrick Suppes say many years ago, "Anything worth doing is worth doing badly". Suppes (a mathematical logician by trade) was quite serious, but his context was the very early days of computer-assisted instruction (remember CAI?). The context today is VERY different--high-stakes testing programs with real consequences for teachers and school officials.

Peace, Michael, but I do not think we can advocate any measure of teacher effectiveness that has not been tested and validated against several different ways of judging teacher competence. Patrick Suppes was working at the leading edge of research, with no serious consequences for anyone if he was wrong. He made many contributions, and quite a bit of money, and no one should be critical of his entrepreneurial spirit and his creativity. Neither should we confuse the context of the late 1960s with the legislated accountability of the middle 90s. Oh, I know, you work with Dan Stufflebeam all the time and you know all about the middle 90s. Your post does not reflect this knowledge. (As you will see, I do not believe that we yet have a measure, or set of measures, valid enough to include in the evaluation of teacher competence.)

So what about all this statistics talk? Gene Glass has indicated his reservations about TVAAS's "mixed models" (perhaps I understate the case). What ALL the regulars on Edpol should consider (those that have not already done so) is that statistics has moved on well beyond "mixed models", in response to the complexity of social groupings--ESPECIALLY students within classes within schools within districts within states (within countries!). Since Gene has listed his excellent book (with Hopkins), let me cite my own contribution to the post-mixed-model literature, an application of multilevel models (with the essential contribution of Harvey Goldstein): McLean, et al., (1988) The reliability of the oral examination in internal medicine. Journal of the Royal College of Physicians and Surgeons of Canada, ... ...). The study is in education (of medical specialists) and is in the mainstream of generalizability theory.

There have been two main channels of development and two major tributaries to better understanding of school achievement:

Murray Aitken and Nick Longford's early work, parallel to Harvey Goldstein's (e.g., 1986, J. Royal Stat. Soc. B, 149: 1-43). Tributary.
Harvey's book, the first, Multilevel Models in Educational and Social Research. New York: Oxford U. Press, 1987. Harvey's group at the Univ. of London Institute of Education offers computer software (ML2 and ML3) and training. Mainstream.
Bryk and Raudenbush, at about the same time: Bryk et al. (1986) An introduction to HLM: Computer program and user guide. University of Chicago. (HLM: Hierarchical Linear Models--see previous Edpol posts). B&R and Co. offer computer software for fitting two-level models and do extensive training. Mainstream.
Nick Longford and Murray Aitken's later work, extending their earlier models and offering computer programs. Tributary.

So what? We have been led to believe that the TVAAS utilise the latest developments in statistical models in order to produce estimates of gain that truly reflect what students are learning. Yet the published sources cited fail to support this claim--in fact the sources fall quite far short. Either Dr. Sanders and his associates are hiding their light under a bushel or their claims are not supported--these seem to me the only alternatives. Throughout the interchange on Edpolyan (I hesitate to call it a discussion, given the reticence of the TVAAS folk), a nagging suspicion has grown among those of us who struggle with test scores and models thereof that the TVAAS structure is a house of cards. Dr. Sanders will no doubt regard this as another bit of "nonsense" from me, but his contributions have yet to add to the "sense". Given the salience of the system in Tennessee, we all hope to have the sense revealed to us.

From: Scriven@AOL.COM

Of course, anything worth doing should be done as well as we can, and I'm counting on you, Gene, and Bill S to get the best measure we can. But it seems to me you're talking unrealistic standards, Les, when you say that we don't yet "have a measure... valid enough to include in the evaluation of teacher competence." One must look at the validity of the measures we currently use in order to see whether TVAAS is "valid enough".

Short of finding a teacher in flagrante delicto with a student, what we use is a sorry bunch of variables, ranging from the process variables observed on a visit to the classroom, some of them vaguely correlated with learning gains via process/outcome research (none of these are legitimate), through reports from parents, noise heard through the classroom walls, all the way to evidence of enrolment in post-grad studies.

I think TVAAS still looks like a better addition to that pile than most of the stuff in it at the moment.

In any case, I'm suggesting that we should judge it in terms of whether it enables us to do better, not just point out that it doesn't do as well as would be desirable.

From: Scriven@AOL.COM
Subject: Re: Dorn on Scriven

No, Sherman, what I'm saying isn't much like your version of it. If someone has a better student-outcome measure of teacher merit than TVAAS, let's show it's better, not just point out imperfections in TVAAS. I know of many allegedly better efforts, but none that hold up under the kind of heavy fire that TVAAS is getting here. If you have a better one in mind, try explaining it here and let's see. In any case, whether you do or not, the standard for criticism of it and TVAAS has to be whether it's better than nothing in the outcome dimension, not whether it's flawless.

As to your claims that (i) I'm assuming that all critics of TVAAS are "against judging teachers by what students learn" and that (ii) in pointing out that there's a safety net for teacher evaluation against errors in the TVAAS model I'm saying the TVAAS is of no value, your logic in imputing them to me seems pretty far-fetched. In any case, I did not and do not intend any such implications.

I'm trying to get us to keep reasonable standards in mind, not ideal ones; that doesn't seem to be an effort that deserves the kind of overkill attack you've launched on the suggestion.

By the way, I thought we were trying to avoid being condescending to each other on this board. It seems to be pretty condescending to say to me "Go read X" (and you'll see how wrong your assumptions are), rather than give reasons here and now. I could throw references at you, too, but I try to get the points across with a summary of reasons.

From: Sherman Dorn
Subject: TVAAS

Michael Scriven writes:

>If someone
>has a better student-outcome measure of teacher merit than TVAAS, let's show
>it's better, not just point out imperfections in TVAAS. I know of many
>allegedly better efforts, but none that hold up under the kind of heavy fire
>that TVAAS is getting here. If you have a better one in mind, try explaining
>it here and let's see. In any case, whether you do or not, the standard for
>criticism of it and TVAAS has to be whether it's better than nothing in the
>outcome dimension, not whether it's flawless.

There are two pieces to this "better than nothing in the outcome dimension." One piece is what you have explicitly described -- trying to include information about student learning in teacher evaluation. Gene's and Les' criticism get at that, and Gene has described at least one alternative.

A second piece, however, deals with the legitimacy of the tool within school cultures. If something (or a set of different somethings) is so bad that it poisons the atmosphere for further attempts to include student outcomes in teacher evaluation, then, yes, I'd argue that it can be possible to be much worse than nothing in the outcome dimension. It is very difficult from the outset to get teachers to pay attention to all their students, and the umpteenth attempt at accountability will just be treated as crying wolf.

The basics of formative evaluation, or curriculum-based assessment, as described in special education journals and as I've described a few months ago, is for a teacher to test students frequently and make instructional decisions based on whether an individual student is meeting a pre-specified goal. Yes, this depends on competent teaching, appropriate selection of goals (from what I gather, many teachers are timid about goals), and tests that are decent and relatively easy to conduct and score appropriately.

But there are several advantages: the information on the student shows frequently-gathered information in a situation that, after a few times, should be immune to practice effects, can be adjusted to be sensitive to student progress for very low-performing children, is much more likely to be seen as legitimate information by teachers than annual test scores, and can be used for evaluation of the teacher (who should be selecting appropriate goals, assessing appropriately, and making decisions in response to student progress or lack thereof). This type of assessment for students was developed in a context of individualized programming (for special education students), but there are also forms appropriate to whole class assessment, and you could make parallel decision rules -- create tests that ask students to perform tasks related to curriculum for the entire year, assess frequently, and respond instructionally to what you see. (And, at the teacher evaluation level, see if the teacher is responding to the information.)

>As to your claims... I did not and do not intend any
>such implications.

I will certainly accept that, and mea culpa if it read as too personal a comment. It did, however, seem rather strange to be arguing that poor methods are better than nothing in a policy context. Maybe I could accept that if TVAAS were used as evaluation in a single or a few school systems. But this is something that affects several hundred thousand children, and thousands of teachers.

In response to Les McLean, Michael Scriven writes:

>Short of finding a teacher in flagrante delicto with a student, what we use
>is a sorry bunch of variables, ranging from the process variables observed on
>a visit to the classroom, some of them vaguely correlated with learning gains
>via process/outcome research (none of these are legitimate), through reports
>from parents, noise heard through the classroom walls, all the way to
>evidence of enrolment in post-grad studies.

As an historian, I see the causal relationship here in a very different way. Yes, teachers are not evaluated on what children learn, but that's not primarily because we have bad outcomes measures. We have bad outcomes measures because schools are not designed to pay attention to outcomes. Even when standardized testing spread in the 1980s, it was frequently not designed to assess teachers but rather punish students for not learning (e.g., the proficiency test I had to pass in California in order to graduate). Daniel Calhoun writes, in THE INTELLIGENCE OF A PEOPLE, of the ways in which our views of intelligence, and ways of judging what children have learned, have typically been a way of blaming children (or adolescents or adults) for not learning. List subscribers here in the past week have volunteered anecdotes of how tests supposedly designed to help children by making teachers accountable instead have created pressures which (I believe) have no business in a school. I don't think it's because people have not yet come up with the perfect statistical tools to turn those test results into good outcomes measures. I think it's because schools have a tendency to sort and blame when the heat's on. With this tendency, I'd rather not rely on something like the TVAAS to get teachers to pay attention to students.

From: Rick Garlikov
Subject: Re: Validity, again (no apology)

Les, TVAAS has given at least one non-mathematical argument to demonstrate the reasonableness of their approach; that is that their numbers match the evaluations supervisors make. And although Harvey doesn't see any purpose or point in my previous request about this matter, it seems to me that one of the implications of TVAAS's claims is that they could quite accurately predict the test scores of students, given sufficient information about that student's past performance relative to his/her classmates, and given the information how those classmates have done on a test the student in question does not take, or has not yet taken. I suggested a trial in which TVAAS makes such predictions based on tests that are taken but withheld from them until after the predictions. Harvey thinks this is nonsense. But it seems to me to be a way of demonstrating whether TVAAS has some sort of statistical power or not, without arguing about the mathematics. As I said in my first post about all this, in science math is only a guide, not a proof of how the real world behaves. Even in physics, they have to do the experiment to try to confirm, not that the math is right, but that it is the right math. I think Gene would agree with this in those cases where he thinks math is applicable at all. And, like Gene, I would agree that there are far more important aspects to evaluating teachers than what can be described or computed mathematically. But TVAAS agrees with that. Their focus is only about one aspect of teaching. They are, to use my baseball analogy, only computing batting averages; they are not claiming batting averages are the only measure of the value of a baseball player. And they are using every means at their disposal to try to make that point abundantly clear.
But with regard to the math, I think a non-math means of demonstrating the reasonableness of the method is perhaps more important than mathematical arguments about the math itself. Especially for those, like me, who have no way of following the arguments Gene, Harvey, Greg, Les, et al, can make, but which don't get to the heart of whether even the most impeccable math applies or not.

From: Sandra P Horn
Subject: Response to Glass from Wm. Sanders

Response to Glass's Jan. 19 post from William Sanders:

Sanders: "To Gene Glass: I hope that reading this article will give you a more accurate picture of what we are doing. After you have looked at it, you may want to restate your questions."

Glass: "I don't appreciate being fobbed off in this manner and I am in an even more atrabilious mood for having traipsed to the library to copy the article referred to (Journal of Personnel Evaluation in Education 8:299-311, 1994)."

Sanders: There was certainly no intent on my part to fob you or anyone else. What has been disturbing to me (and still is), is how you or anyone else could assume to know what we are doing in TVAAS without examining relevant materials and then write criticisms based upon your own assumptions in such declarative tones. Now that at least you have looked at the paper to which we referred, we can begin to discuss your criticisms from the perspective of the model(s) which we are using.

Glass: "It is said that pretest achievement scores will be the "filter" through which these exogenous influences operate on the posttest scores, hence, the student background characteristics can be ignored. If this is an assumption of the system, it is clearly contestable and nearly certainly false; if it is a belief about what accounts for variation in achievement scores, it is incompetent."

Glass: "The only way that such background influences can be assumed to enter the TVAAS system is through the "filter" of pretest scores; to think that this "filter" is sufficient to correct for the background characteristics is simple fantasy."

Sanders: Observe that we fit the model (including all teachers over all subjects over all grades simultaneously within each school system) to the entire observational vector available for each student with the appropriate variance-covariance structure within the r-matrix. By so doing, one can view each student vector to be "like" an incomplete block with the analogy to the analysis of incomplete block designs. After obtaining teacher and school effects this way, what justifies our claim that most of the socio-economic confoundings have been filtered?

We do ex post facto analyses relating the teacher or school effects to variables that have been accepted by some (at least) to proxy socio economic status. We have found the following based upon the state-wide analysis. There is no relationship between the school effects and: (1) the percentage of students receiving free and reduced lunches in the school; (2) the racial composition of the student body; (3) the location of the building as to urban, suburban and rural.

Also, the relationship between the school effects and the school means are extremely low as can be seen in the following table. {No we do not have state-wide a measure of student ability. However, in our early studies we found that the inclusion contributed virtually nothing}.

Simple coefficients of correlation between school effects (math) and the mean for each school (3 year averages).

      Grade       r     N (no. of schools)
        3       .056        898
        4       .081        888
        5       .101        865
        6       .163        673
        7       .105        509
        8       .078        510

With relationships of this magnitude, it certainly is easy to show numerous examples of schools across the entire spectrum of mean achievement scores who have obtained excellent gains.

The statements that we have made are not mere assumptions but rather come from the results of the data analysis!! However, if needed, the modeling process which we deploy does not in any way preclude the use of other covariables. Let me restate that, to date, we have not found the need for these additional variables. If, in the future, using other test data, a need is found to insure fairness, then the models will be expanded to include them.

Glass says, "How, I might ask, does TVAAS imagine that the positive correlation between ability and achievement arises in the world if more able students do not learn at a faster rate than less able ones?"

One of the major findings from the state-wide analysis of the data is that most (not all) of our school systems do not have the curricular stretch to allow the highest achieving students to express gain commensurate with the gain obtained by the average and below average students. However, in other systems, we find this to be not true! In those systems, students of all achievement levels are making satisfactory gains.

Glass: It is clear from reading the TVAAS responses to questions posed here over the past few weeks and from reading their latest published exposition of the method that they have no appreciation of the validity question whatsoever.

That is a heavy charge. Let me share several additional findings.

1. As part of the second study (data from Blount County), the principals were asked to forecast whether each teacher would profile in the top, mid or bottom third of the Blount County distribution. The principals forecasted about 90% of the bottom profiling teachers, and could distinguish between the top and mid groups of math teachers; however, they could not distinguish between the top and mid reading and language arts teachers.

2. As was mentioned in the most recent post, we now have merged the writing assessment data into the master data base. Even though the studies are not complete, it appears that we could substitute this data for the language arts data from TCAP without appreciably changing the rankings of schools within a system.

3. We have data from the 10th grade TCAP and from the 10th grade PLAN tests. The data could be interchanged with virtually no change in the rankings of systems and schools.

4. When we have had direct knowledge of change in educational practice, then we have observed change in the effects. For instance, Knox County, which has a middle school system, has had severe retardation in the gains for 6th grade (the first year of the middle school). This past year a major effort was launched to improve communication between feeder schools and receiving schools such that instruction could be provided earlier in the school year commensurate with where the feeder schools had left off the previous year. After this effort, the Knox County 6th grade gains improved appreciably.

6. One of the administrators at Tennessee State University (TSU) gave us list of several schools in which they had assisted in developing a "hands on approach" to teaching science. In all of these schools, we found that the cumulative gains (gains summed over grades) were greater than 100% cumulative norm gain.

7. For those schools that are offering pre-algebra in the seventh grades, their gains are considerably higher than those which are not.

Does any of this speak to validity?

Cost

Some have been addressing this question. Let me mention a cost that needs to have the most consideration. That is the cost of denying human beings a chance to be competitive in later life because they were unfortunate enough to have attended schools and systems that were not offering a competitive academic program.

To me the purpose of all formal education is to provide as many human beings as possible, as many choices in life as possible. As I mentioned in the previous post, we found that tremendous difference exists among our school systems in mean ACT scores, when considering students that were at the same place academically as measured with the 8th grade TCAP tests. For example, considering only the top quartile of eighth graders, the top two or three systems will have mean ACT math scores of 27 while other system means will be 18. Why? Because most of these low averaging systems do not offer accelerated math programs, no AP courses, etc., etc. TVAAS is not just about profiling teachers, but rather is an attempt to measure those influences that are either accelerating or impeding the academic progress of populations of students.

Many can quibble about our models, can quibble about the tests and testing, can argue that the taking of the blood sample will do more harm to the patient than the illness. However, any reasonable analysis of the totality of the data will confirm that this process is fair and indeed has begun to bring pressure on school officials to improve the educational process.

No system, school or teacher is being asked to do more in Tennessee than is presently being done by educators working under similar circumstances. The extreme variation in effectiveness that exists among systems, schools and teachers is THE problem in public education in Tennessee. Attempts to attribute these findings solely to the testing or the statistical methodolgy will require extreme contortions of logic to evade the mountains of evidence to the contrary.

William L. Sanders
Director and Professor
University of Tennessee Value-Added Research and Assessment Center

From: Sandra P Horn
Subject: Re: Les McLean on TVAAS

There have been several messages similar to this one that originated from Les McLean. I would like to respond to it as a member of the TVAAS team, but not as a spokesperson for the team or for the UT Value-Added Research and Assessment Center.

On Sat, 21 Jan 1995, Leslie McLean wrote:

> Gene Glass's hypothetical procedure for local use of test scores in teacher
> evaluation ended with a modest claim--for a modicum of validity.  The
> post arrived just as I finished re-reading an ancient manuscript--Samuel
> Messick's presidential address to Div. 5 of APA in August 1974, "The
> Standard Problem: Meaning and Values in Measurement and Evaluation".  My
> it has stood up well! It contained reference to an even more ancient
> scroll, Lee Cronbach's writing on validity (with Meehl), going back to 1955
> (and thus more than 40 years's old).  The phrase that deserves emphasis
> here in 1995 is that we do not validate tests, but "AN INTERPRETATION OF
> DATA ARISING FROM A SPECIFIED PROCEDURE".  (That's from Cronbach's
> chapter in Ed. Measurement, edited by R.L. Thorndike, 1971, p. 447).
>
>    As I have come to understand the TVAAS, it goes like this:
>
>    These numbers are measures of gain in student achievement over a
> school year, and our PROCEDURE entitles you to read them as measures of
> teacher competence--the larger they are, the better the teacher, and v.v.
>

Then you have not been reading the posts we have sent you, or you are selectively ignoring whole sections. TVAAS uses student data longitudinally over at least three years. So far, reports have only been issued for schools and systems. Teacher reports will be issued for the first time this year. Teachers are only asked to attain normal gains for their students. Standard errors are employed so that a range of scores fall within the definition of "normal gain." We are working closely with teacher and administrator representatives to insure the privacy of individual teachers and to develop meaningful, useful reports for them.

>    Many of us, based on lots of experience, are extremely skeptical of
> this claim, and so far the explanations have only increased our
> skepticism.  The procedure is SO questionable (from test content to
> scaling method to model-fitting and adjustment) that any interpretation
> such as I have given above should be withheld pending supportive
> evidence--talk about "meaning and values in measurement and evaluation"!

We are attempting a dialogue with those of you interested in TVAAS. We didn't realize there were so "many" of you, since most of the posts have come from a handful of people. If you have questions, we continue to attempt to address them. What is it you find questionable, specifically? Asserting that the model is questionable certainly should be withheld pending supportive evidence. Talk about "meaning and values in responsible discourse"!

>   As many posts have argued, the TVAAS numbers become the only evidence
> to get attention, not because the TVAAS folk say so but because the sheer
> size and bulk and prestige of the enterprise drive out other measures.

These "many posts" have originated from a very few individuals who may not be representative voices. TVAAS has not by any means driven out other forms of assessment. The Tennessee Writing Assessment has just recently come on line. Performance evaluation of teachers continues, apace, and is the primary means of teacher evaluation in Tennessee. As we have stated repeatedly, TVAAS currently assesses the effects of educational practice on the outcomes students portray in five subject areas in grades 3 through 8, hardly the bulk of education in Tennessee.

> Messick ended his speech with a quotation from one Liam Hudson (1972),
> from his book, The Cult of the Fact (p. 125).  Hudson argued that social
> science:
>      ...should be pictured not as a society of good men and true,
>      harbouring the occasional malefactor, but rather, as one in
>      which everyone is searching for sense; in which differences
>      are largely those of temperament, tradition, allegiance and
>      style; and in which transgression consists not so much in a
>      clean break with professional ethics, as in an unusually high-
>      handed, extreme or self-deceptive attempt to promote one
>      particular view of reality at the expense of all others.
>
>
> More strenuous efforts would appear to be required to avoid such a
> transgression in Tennessee.

This is the part I consider insulting. Categorically, we have never broken with professional ethics. We communicate with educators on every level, within our state and across America, Edpolyan being only one example, seeking input, reaction, responsible criticism, and an understanding of what educators need in order to most effectively use TVAAS data.

TVAAS was created for the betterment of education. That is its purpose. There are no ulterior motives, and no one here stands to profit from some demonic plot to crucify teachers. Nor are we self-deceiving. We are fully cognizant of how achievement data have been misused by media and others in the past. TVAAS is an attempt to change that. Could it be that perhaps you are stuck in the old paradigm, that because this model uses data from standardized tests that you are making assumptions based on past usage rather than openly examining this new way of analyzing and reporting outcomes?

We have remained civil when those with whom we seek to communicate sink to such off-handed insults as these and when others have accused us of everything from arrogance to statistical ignorance rather than question their own assumptions or attempt to work collegially toward solutions.

It is not high-handed to refuse to engage in this type of low insinuation. Furthermore, it is not high-handed when our responses do not come fast enough to satisfy your schedule. We have tried to explain why there are delays but, repeatedly, this is interpretted as a personal issue rather than a logistical one. Is there a reason why we must waste time in this manner?

I am not willing to abandon my belief in the ability of people of good will to solve society's problems. After all, that's what brought me to the TVAAS team in the first place. I just find it very discouraging to find so little good will. It never occurred to me that this discussion would devolve into paradigm wars.

We have two discussions going on here: the philosophical aspects of educational assessment and the statistical propriety of the TVAAS model, both of which are vital. I don't foresee much progress being made on either front unless we base the discussions on belief in the good will of our partners in seeking solutions to the problems in education and, in the absence of evidence to the contrary, in the competence of our fellow discussants in their respective fields.

Perhaps this bickering is normal and I just don't know how to play this game. Forgive me. I am new and perhaps naive. But I, for one, find it counterproductive and saddening whenever, in any circumstances, people do not treat each other with respect, especially when egos preclude the open exchange and consideration of ideas.

Sandra Horn

From: Sandra P Horn
Subject: Re: Camilli's q's re: TVAAS

Dear Greg, Sorry it has taken so long to get back to you on these important questions. In regards to the CTBS/4, which is used in the Tennessee Comprehensive Assessment Program (TCAP) tests that provide the scaled data for TVAAS:

1. You asked us some time ago about how the TCAP was scored. It is scored by pattern (IRT) scoring in accordance with CTS's three-parameter statistical model.

2. You also asked what form of the CTBS/4--Benchmark, Battery, or Survey--was used. The full battery is used for the math, language arts and reading tests. Science and social studies are tested with the survey versions.

3. There has been considerable discussion on the issue of "teaching to the test." John Covaleski points out that the degre to which a test is capable of being "psyched out"--by this, we understand that he means "predicted"--by stakeholders has a great deal to do with how valid the results will be. We agree that this is true. The current TCAP tests, both the norm-referenced subtests and the criterion-referenced subtests, are comprised of 70% new items every year, the remaining 30% being drawn from a bank of repeat items from all previous years' tests. CTB certifies that the new items are equivalent to the items they replace in scoring properties. The purpose of requiring new items is specifically to discourage the educationally indefensible--and ineffective--practice of teaching to the test. Those of us involved with TVAAS anticipate that TVAAS will show that students who are fortunate enough to have teachers who develop concepts and encourage the discovery of connections do far better on all forms of assessment than those under teachers who waste instructional time trying to "psych out" the test. If, indeed, this turns out to be the case when the teacher reports are issued later this year, perhaps the preferred strategy for improving test scores will more often center around improving instructional strategies that meet the needs of each individual student, and the teaching to the test strategy will be abandoned in the face of evidence of its ineffectiveness.

We appreciate your continued interest in TVAAS.

Sandra Horn

From: Alan Davis

Rick recently argued that teaching to high stakes tests probably would not have adverse consequences if the tests were good and if teachers "knew that it is good teaching that gives the best chance for improved test scores."

One of the principles of high stakes testing, documented by Linda McNeil in Texas and supported by other research (see Eva Baker at UCLA) is that things that are tested push out curriculum that is not tested. So when Jane Armstrong (Education Commission of the States) and I visited school districts in Virginia in a study of elementary science instruction, teachers in schools where standardized test scores were low in Reading and Math informed us that they didn't teach much science at all, because science either wasn't tested or the science scores weren't of muc interest to parents. About the same time, the Michigan Association of Science Teachers pressured the state to start testing students in science as part of the Michigan Educational Assessment Progam (MEAP) so that teachers would begin teaching science -- it had been shoved out by reading and math.

The consequence of this is that everyone wants their subject tested, but to test everything that is important for kids to do in schools becomes a matter of overkill. Most will agree on the central importance of reading and math, but most will also agree that kids should learn other things in school that are less readily measured. In high stakes states, reading and math (and language arts, defined by the tests as editing sentences containing mistakes) push out other instruction.

From: Leslie McLean
Subject: Hommage a Sandra P. Horn

Bravo, Sandra Horn, for your spirited responses on behalf of the TVAAS! No insult was ever intended in my postings, but since insult was felt, please let me apologize.

I was particularly grateful for your clarification that you had not yet reported scores by teacher as gains-attributed-to-teacher-competence, but that you would shortly do so. Some of us (perhaps even you) forget sometimes that this is a listserv discussion and not a debate in the Tennessee legislature. We anticipate events--always a risky venture--but that is the way of the scholar--however bothersome it may be to the people who are on line for actions and justifications every minute. As unlikely as it may seem, Sandra, I have been the person who had to draft responses for the Minister (read: Chief elected official) on occasion. I would like to share with you sometime my reply to the school district official who wrote to the Minister to complain about our "Perception Bag", an unstructured set of curriculum materials on perceptions--including bits of paper soaked in scents--a sick drunk, a new Buick, ...

Please let no one ever read my posts as suggesting that the TVAAS staff are not concerned with the welfare of teachers and students. You have made your case well and I would call attention to it.

I not only read your posts, I save them (most of them)! For example, here is an excerpt:

Sanders:
Observe that we fit the model (including all teachers over all
subjects over all grades simultaneously within each school system)
to the entire observational vector available for each student with
the appropriate variance-covariance structure within the r-matrix.
By so doing, one can view each student vector to be "like" an
incomplete block with the analogy to the analysis of incomplete
block designs.  After obtaining teacher and school effects this
way, what justifies our claim that most of the socio-economic
confoundings have been filtered?
  
     We do ex post facto analyses relating the teacher or school
effects to variables that have been accepted by some (at least) to
proxy socio economic status.  We have found the following based
upon the state-wide analysis.  There is no relationship between the
school effects and: (1) the percentage of students receiving free
and reduced lunches in the school; (2) the racial composition of
the student body; (3) the location of the building as to urban,
suburban and rural.
  
     Also, the relationship between the school effects and the
school means are extremely low as can be seen in the following
table.  {No we do not have state-wide a measure of student ability.
However, in our early studies we found that the inclusion
contributed virtually nothing}.
  
  Simple coefficients of correlation between school effects (math)
and the mean for each school (3 year averages).
 
      Grade       r     N (no. of schools)
        3       .056        898
        4       .081        888
        5       .101        865
        6       .163        673
        7       .105        509
        8       .078        510
 
With relationships of this magnitude, it certainly is easy to show
numerous examples of schools across the entire spectrum of mean
achievement scores who have obtained excellent gains.
 
     The statements that we have made are not mere assumptions but
rather come from the results of the data analysis!!  [end of clip from
TVAAS post]

Thank you for the reports of your data analysis, which support what I noted, that you are concerned about teachers and students. It appears that some of us have not clearly communicated our concerns, since our concerns antedate all the data analysis reported above. We are concerned, and our concerns are not laid to rest, by descriptions such as, "we fit the model simultaneously ...over all grades ... to the entire student vector". How long must a student be in a teacher's class in order to be counted amongst that teacher's "achievements"? (Or do students not move in and out of classes in Tennessee--in many Ontario classrooms, there are more than 50% "in's and outs" in a year. And you fit the model over three years! I agree--the incomplete block design/metaphor is apt, and please forgive some of us if we sometimes lose patience and feel "fobbed off" by simplistic explanations.

Please do accept, Sandra, that those few of us who are trying to stay in communication with you on these very important but technical issues are people who also care about teachers and students--and colleagues who are charged with implementing complex systems. Some of us have likely been at it longer than you have; not that this makes us wiser, but it does give us a fairly wide context. We listen to wise colleagues such as Michael Scriven who remind us that teacher evaluation must somehow include consideration of student achievement. Our context, however, includes some distressing and sceptisism-inducing experiences with systems that begin with IRT scaling of multiple choice items by three-dimensional models and carry on to models of exceptional computional complexity to results said to have simple interpretations. If you could share our experiences, I would hope you would not feel an insult at our questions. Let us keep up the dialogue.

From: Sandra P Horn
Subject: TVAAS and moving students

Dear Les, Thanks for your kind post. It made me feel a lot better. I have always assumed that all of us were participating here with the common aim of hashing out direction and means for insuring the best education possible for our children. It's a point I felt needed to be be made again, though-- that mutual respect and open inquiry are the means by which our aim can be achieved.

As to your question on how moving students affect teacher reports, the answer is that students must be in a teacher's classroom a minimum of 150 days during the school year in order to be represented in that teacher's cohort. So not only are teachers protected from being held accountable for students who enter late, they are also not held accountable for students with excessive absences (students have a 180-day school year).

Please continue to ask.

From: Gene Glass
Subject: Why TVAAS Exists. Yesterday, Sandra Horn wrote:

>   TVAAS was created for the betterment of education.  That is its
> purpose. There are no ulterior motives, and no one here stands to profit
> from some demonic plot to crucify teachers.  Nor are we self-deceiving.

We can certainly credit the motives of Sandra Horn and William Sanders without accepting this somewhat oversimplified explanation of what caused TVAAS to be. Those of us who have studied the politics of similar efforts across the past 25 years have reason to doubt that TVAAS is solely an expression of a popular will to better education--even if it is far short of a profit-making daemonic plot.

An alternative account of the creation of TVAAS holds that it was proposed by traditional foes of public-education funding in the Tennessee Legislature as the compromise required to pass career ladder legislation. The career ladder legislation provided big financial benefits for teachers. "If teachers are going to get big payoffs like that, then they better prove they are adding value to students' lives." Such is the political rhetoric of accountability.

I think it is fair to say that the legislative genesis of TVAAS bore all the markings of suspicion toward teachers and schools and hostility toward increased education expenditures. This doesn't doom it to fail statistically or politically, but it is relevant context for understanding how it will operate, how its strengths and weaknesses will be seen and what its future is.

From: William Robert Saffold
Subject: More TVAAS Questions

I have been "lurking" for some time now, but I have yet to see many of my basic questions about TVAAS answered, or addressed sufficiently. I would love to hear (on-list) from the TVAAS staff or anyone else who can help me understand exactly what is going on with value-added assessment. Following are some observations and questions:

1. Dr. Sanders claims that scores are consistent from year to year, but the variations in gains (or losses) are actually quite large. Following are a few actual district report cards in various subjects. I have chosen small, medium, and large districts just to show the variation in gains. I have included the explanatory material at the end of this section:

___________________________________________________________________
Hollow Rock--Bruceton Special School District--804 students
 
Language--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm     667.0   696.0   707.0   724.0   739.0   749.0   757.0
1991:        681.2   705.7   706.4   729.9   734.5   745.5   757.1
1992:        698.5   701.6   723.4   737.6   757.1   738.0   763.0
1993:        692.0   693.6   708.7   732.0   741.0   753.4   760.4
1994:        673.9   695.2   711.3   725.9   737.3   761.9   763.5
 
Language--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     29.0    11.0    17.0    15.0    10.0     8.0
1992:         20.4    17.7    31.2    27.2     3.5    17.5      130.6
1993:         -4.9     7.1     8.7     3.4    -3.7    22.3       36.5
1994:          3.2    17.7    17.2     5.3    20.9    10.1       82.6
 
3 Yr Avg       6.2R*  14.1G   19.1G   12.0R    6.9R   16.6G      83.2
Std Error      2.8     2.5     2.1     2.0     2.2     2.3
 
______________________________________________________________________
Campbell County School District--6,410 students
 
Science--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm:    655.0   690.0   709.0   732.0   745.0   756.0   765.0
1991:        663.7   688.5   697.0   708.5   723.9   733.1   752.6
1992:        672.9   694.7   707.5   719.5   724.0   743.2   761.2
1993:        660.5   683.8   709.6   713.5   738.5   740.6   755.8
1994:        673.6   698.0   706.6   723.1   716.4   736.0   757.0
 
Science--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     35.0    19.0    23.0    13.0    11.0     9.0
1992:         30.9    19.0    22.5    15.5    19.3    28.1      123.0
1993:         10.0    14.9     6.0    19.1    16.6    12.5       72.8
1994:         37.5    22.7    13.5     3.0    -2.6    16.4       82.4
 
3 Yr Avg      26.5R*  18.9Y   14.0R*  12.5Y   11.1G   19.0G      92.7
Std Error      1.5     1.3     1.3     1.2     1.2     1.1
 
____________________________________________________________________
Chattanooga City School District--20,159 students
 
Social Studies--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm:    652.0   691.0   713.0   735.0   745.0   749.0   761.0
1991:        664.7   686.1   717.3   738.9   736.7   737.8   754.2
1992:        669.1   693.9   718.4   743.9   740.8   750.5   756.7
1993:        645.8   690.4   721.4   726.2   729.3   751.5   761.2
1994:        642.8   674.7   705.9   733.6   739.6   747.4   754.0
 
Social Studies--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     39.0    22.0    22.0    10.0     4.0    12.0
1992:         29.3    32.1    26.5     1.9    13.5    18.9      112.0
1993:         21.0    27.2     7.6   -14.5    10.6    10.5       57.3
1994:         28.5    15.1    12.1    13.2    17.6     2.4       81.6
 
3 Yr Avg      26.3R*  24.8G   15.4R*   0.2R*  13.9G   10.6R*     83.6
Std Error      0.9     0.7     0.6     0.6     0.6     0.6
____________________________________________________________________
Mixed Model Analysis using Scall Scores from Norm-Referenced Section of the
TCAP
 
G=  Green Zone:  Estimated mean gain equal to or greater than national norm
Y=  Yellow Zone:  Gain below national norm by 1 std error or less
R=  Red Zone:  Below norm by more than 1, but no more than 2, std errors
R*=  Ultra-Red:  Below norm by more than 2 std errors
 
NG=  Negative Gain:  no percent-of-norm calculated

Slight variations noticed in this year's estimates when compared to last year's estimates are a function of the fine tuning of the methodology. The scores for previous years will change slightly as the most recent data are incorporated, refining previous estimates.

[I HAVE TYPED THIS ALMOST VERBATIM FROM THE 1994 TVAAS REPORT--I hope that I have typed all the numbers correctly. Apparently, there is a computerized copy of the data available, but the TVAAS staff have fixed it so that no data can be manipulated or copied electronically. I would love to hear their explanation/justification for this.]

___________________________________________________________________

Dr. Sanders' position seems to be that all variations are the result of educational practice (unless someone else proves otherwise). The huge variations from year to year would seem to undercut his position. If the same teachers are teaching (presumably) in much the same way each year, what accounts for the variation in gains? Could the measurement instrument have anything to do with it? Or could the model just be misspecified?

2. Dr. Sanders (through Sandra Horn) claims that"

"When we have had direct knowledge of change in educational practice, then we have observed change in the effects. For instance, Knox county, which has a middle school system, has had severe retardation in the gains for 6th grade (the first year of middle school). This past year a major effort was launched to improve communication between feeder schools and receiving schools such that instruction could be provided earlier in the school year commensurate with where the feeder schools had left off the previous year. After this effort, the Knox County 6th grade gains improved appreciably."

The following are gain scores for 6th graders in Knox County:

 
                Math    Reading         Language        SocS    Science
USA Norm        19.0    18.0            15.0            10.0    13.0
1992:           12.5    15.1            14.1             1.1     5.3
1993:            9.7     7.8             5.2            -7.0    12.1
1994:           16.6    13.9             4.3            13.2     6.3
 
3 Yr Avg        12.9    12.3             7.9             2.5     7.9
Std Error        0.3     0.3             0.3             0.4     0.4

Sixth graders experienced drops in gains in 1993, and recoveries in some gains in 1994--but improvement in gains scores is not even across all subjects tested. Dr. Sanders provides nothing more than anecdotal evidence to prove that any increase in gain scores is due to educational effort--and he fails to mention the drops in language arts and science. Comparisons between gain scores in 1992 and 1994 show virtually no overall improvement. Gain scores in 1993 look more like the exception than the rule. THe change in gain scores could be, among many other potential factors, merely natural year-to-year variations.

3. I keep reading that TVAAS differentiates between district, school, and teacher effects on student learning. Does the student gain at the classroom level equal the teacher's effect? Does the student gain at the school level equal the school's effect? Does the student gain at the system level equal the system's effect?

4. Dr. Sanders admits that we don't even have a measure of student ability. How can we know that a student's innate ability has no effect on gain? (And if ability--or lack thereof--doesn't matter, why are special education students excluded from teacher assessment?) Dr. Sanders believes that low-achieving students (who may or may not be low ability students) can make "satisfactory gains" consistent with the national norm gain---but a low-ability student may not be able to do so, at least at the same rate as a higher-ability student (and the implicit assumption in the TVAAS model seems to be that all students should learn at the same rate--the national norm gain). Thus teachers might have to work much harder with a group of low-ability students in order to achieve acceptable gain scores. This discriminates against teachers with low-ability (as opposed to low-achieving) students.

5. Will the TVAAS staff please provide information about the correlation between the results of the norm-referenced test and the criterion-referenced test? It would help to know how well the norm-referenced test matches the Tennessee curriculum. We could have some objective evidence that the norm-referenced test sufficiently matches the curriculum if it can be shown that students score roughly the same on the two tests---at all grade levels and in all subjects.

Thanks, Bob Saffold
Vanderbilt Institute for Public Policy Studies

From: Sandra P Horn
Subject: To Saffold Re: More TVAAS Questions

I have only this minute read the post from which this exerpt is taken, so I can only reply to one item at this time and it is the one below:

On Wed, 25 Jan 1995, William Robert Saffold wrote:

> 2.  Dr. Sanders (through Sandra Horn) claims that"
>
> "When we have had direct knowledge of change in educational practice, then
> we have observed change in the effects.  For instance, Knox county, which
> has a middle school system, has had severe retardation in the gains for 6th
> grade (the first year of middle school).  This past year a major effort was
> launched to improve communication between feeder schools and receiving
> schools such that instruction could be provided earlier in the school year
> commensurate with where the feeder schools had left off the previous year.
> After this effort, the Knox County 6th grade gains improved appreciably."
>
> The following are gain scores for 6th graders in Knox County:
>
>                 Math    Reading         Language        SocS    Science
> USA Norm        19.0    18.0            15.0            10.0    13.0
> 1992:           12.5    15.1            14.1             1.1     5.3
> 1993:            9.7     7.8             5.2            -7.0    12.1
> 1994:           16.6    13.9             4.3            13.2     6.3
>
> 3 Yr Avg        12.9    12.3             7.9             2.5     7.9
> Std Error        0.3     0.3             0.3             0.4     0.4
>
> Sixth graders experienced drops in gains in 1993, and recoveries in some
> gains in 1994--but improvement in gains scores is not even across all
> subjects tested.

You are absolutely correct, and the fault is mine that this example was included in Bill Sanders' post.

When Bill wrote out his response to questions regarding validity, he forwarded it to me for a read-through. He also sent another post that told me to remove the item regarding Knox County schools because upon checking the figures, he had found that the findings were not consistant across all subjects, although improvements had occurred in some. However, there were two items regarding Knox County, and I removed the other one. I then sent the post to Edpolyan without returning it to Bill for checking. I just assumed I'd deleted the right thing. Upon reading this post, I discovered my mistake.

As for why these "anecdotal" examples were provided, it was in answer to those who find statistical validity "invalid" in terms of educational assessment.

Saffold raises several questions that will required detailed responses. We will check through his figures to be sure they are correct, although I assume they are. However, I am sometimes mistaken in my assumptions (see above), so we will check them for typos. We will attempt to answer each point in detail.

I apologize to all of you and to Bill Sanders for my mistake.

Sandra Horn

From: Greg Camilli
Subject: Re: Validity, again (no apology)

Rick Garlikov and Michael Scriven have suggested on more than one occasion that any test scores are fallible; surely that is no reason to abandon them in the assessment process. The problems with the TVAAS program may be

The scale of the scores may have anomalies (more variance in grade 1 than in grade 12),
The model may not take into account student background (e.g., family income, neighborhood, preschool, parents' education),
The model and its technical properties (e.g., standard errors) may be understood by only a very few,
*Some* test scores may be corrupted by the unscrupulous, other less wittingly,
The testing program may encourage some to teach the test rather than the educational skills,
The population actually tested may not be well understood (e.g., the effects of absences and exemptions.

When I think about it, the differences between this program and many other assessment programs is not that great, especially because the high stakes provisions seem to be loosely coupled with legal consequences. I don't remember reading in the material that Serman provided whether there are graduation consequences for seniors (like there are in other assessment programs.) However, I have reacted to some of the more extravagant claims made for the model.

Suppose it were the case that many of these claims were untrue (e.g., disentangling teacher and student effects). And suppose we agreed that test scores reflected the actual level of attainment for a group of children that we would describe at length. Thus, a child is tracked, and an unadjusted growth curve is plotted, or a teacher is tracked. We don't pretend the information is unconfounded with uncontrolable sources of variation.

Given this scenario, I have a couple of policy questions:

Is this information useful? If so, how?
Is it what the "public" or state adminstrators want? (As opposed to what you, the reader, want.)
Are the *model's* claims essential for its role as a high stakes test?

I ask these questions in good faith; they arose from consideration of TVAAS, but extend far beyond. Can *large-scale* testing programs can be used to evaluate teachers and students?

From: William Robert Saffold
Subject: TVAAS & Legitimate Inquiry

Kathy Bolland's response to my post underscores the reason I have refrained from entering the discussion until now. Debates of this nature always seem to end up forcing people into one of two positions: either they deny that there are any valid and/or cost-effective means of evaluating teachers, or they accuse those who raise questions about assessment of trying to let poor teachers off the hook. I categorically reject this false dichotomy. Raising legitimate questions about a new and poorly-understood model of assessment hardly strikes me as an attempt to "kill the messenger." Failing to raise basic questions about model construction and interpretation of results strikes me as an abdication of our responsibility as educators and citizens.

My point about the TVAAS and student abilities was this: There are differences in student abilities and those differences have an impact on value-added scores. If the TVAAS does not take student abilities into account, then it is more favorably disposed to some teachers than to others. If students were randomly assigned to teachers, this point might be moot--but we know that this is not the case. If you were a classroom teacher, would you want to take the risk that the TVAAS was unfairly weighted against you? Or would you raise the questions in hopes that the issue might be remedied? I choose the latter course.

Both Kathy and Rick seem to believe that I reject the value-added scores themselves as indicators of student gain. This is not so. What concerns me is that gains (or losses) at the classroom level might be ascribed solely to teachers--without any acknowledgment that there are other factors at work here. Of *course* it is important to know that low-ability kids (or high-ability kids for that matter) are not reaching their potential--but is it reasonable to assume that *all* children, regardless of their natural abilities, will achieve the same amount of gain (i.e., the national norm gain) in each subject each year? And is it reasonable to tell all teachers, regardless of the types of children they teach, that each and every class of children must meet or exceed that national norm gain or they have failed as teachers? The answer to both questions may be "Yes"--but I (and, I suspect, many others) would like to see some evidence to support this position.

I do not see these questions as an attempt to shift the blame for poor student academic gains away from teachers and on to the model itself. Quite the contrary--I think the model may very well be incredibly useful. As I told Sandra Horn in an off-list post, I am sincerely interested in the TVAAS. My questions are meant to enlighten both myself and others as to the way the model really works. I hope they will be regarded in that light.

From: Sherman Dorn
Subject: TVAAS

I've been sitting on this for a few days, in part because of scheduling and a cold, but also because I've been wanting to think about a few posts. Rick Garlikov writes:

> (3) if administrators and the media and public cannot be made to
>understand how to use these numbers as an indicator for this one narrow
>aspect of schooling, I don't see how the kind of thing Gene and Sherman
>describes is possible for them to apprehend and use at all.  It seems to
>me that what TVAAS is advocating is even easier for, say, administrators
>to understand than what Gene or Sherman is advocating.

This bewilders me. In part, this is because I think what I wrote and what Gene Glass wrote were very simple and easy to understand. Maybe that's my bias. But in addition, from a political standpoint, simplified statistical explanations may very well encourage oversimplified policies. People who don't understand TVAAS numbers except as "well, they mean something, and here's the ranking of your school" WILL take the numbers as absolute, which is what NO ONE ON THIS LIST has advocated. When TVAAS numbers were provided the print media this year, Nashville papers printed the rankings of schools without any indication of which differences in rankings meant something, even to the TVAAS statisticians.

In response to Gene Glass, William Sanders writes:

>     We do ex post facto analyses relating the teacher or school
>effects to variables that have been accepted by some (at least) to
>proxy socio economic status.  We have found the following based
>upon the state-wide analysis.  There is no relationship between the
>school effects and: (1) the percentage of students receiving free
>and reduced lunches in the school; (2) the racial composition of
>the student body; (3) the location of the building as to urban,
>suburban and rural.

This is provocative, but it is ecological reasoning. You cannot extrapolate from an analysis of system characteristics to student characteristics. To demonstrate the irrelevance of a variable to TVAAS, UTVARAC should see if adding individual characteristics such as race, sex, family income, or the presence of a disability would change the rankings of effects on various levels (teacher, school, or system). Or, if one merely wants to look at previous student scores, UTVARAC should be able to calculate bivariate plots/regression equations for each grade, pairing up the scale scores from the previous year to the current year (for example, fourth-grade math computation test scale scores to those students' math computation test scale scores for the prior year, when they were in third grade). What do the plots look like? What are the unstandardized regression coefficients? If Sanders is right, the plots should look very linear and (as importantly) the regression slope should be just about 1 (i.e., kids with high prior scores should be gaining the same as kids with low prior scores).

Later in the same post, Sanders writes:

>     The statements that we have made are not mere assumptions but
>rather come from the results of the data analysis!! However, if
>needed, the modeling process which we deploy does not in any way
>preclude the use of other covariables. Let me restate that, to date,
>we have not found the need for these additional variables.  If, in
>the future, using other test data, a need is found to insure
>fairness, then the models will be expanded to include them.

This assurance seems rather unusual for someone producing publicly-consumed statistics. In cases of controversy or doubt, the most reputable statistical bureaus in the country produce alternate sets of statistics with explanations of their assumptions. For example, the Census Bureau provides alternative population projections (labeled simply high, medium, and low) for the near-term future, and explain what assumptions led to each series. I believe that the Congressional Budget Office also provides alternative budget projections, and the Labor Department provides scores of alternative statistics one can choose from. Even the Department of Education, in its annual report on dropout/graduation statistics, is beginning to include alternatives to their three dropout "rates."

I see no reason why TVAAS could not produce alternative analyses with different variables added. It seems like the responsible thing to do, since this is one of the most controversial issues related to TVAAS.

Later yet, in response to Glass' question about general concerns with validity, Sanders made several claims about TVAAS:

>2.  As was mentioned in the most recent post, we now have merged
>the writing assessment data into the master data base.  Even though
>the studies are not complete, it appears that we could substitute
>this data for the language arts data from TCAP without appreciably
>changing the rankings of schools within a system.
 
>3. We have data from the 10th grade TCAP and from the 10th grade
>PLAN tests.  The data could be interchanged with virtually no
>change in the rankings of systems and schools.

These two bits of information suggest that the relative rankings would not depend on the tests mentioned. It still assumes that these high-stakes tests are the appropriate venues for judging student performance.

>4. When we have had direct knowledge of change in educational
>practice, then we have observed change in the effects.

This is a rather broad generalization. In order to confirm the thesis of improvement in effects if and only if changes in practice, you would have to have definitions of "change in educational practice" and then ask the question of every school in the state. And even this statement is not precisely true. For when William Saffold questioned the evidence of this in Knox County, Sandra Horn acknowledged the discrepancy:

 
>When Bill wrote out his response to questions regarding validity, he
>forwarded it to me for a read-through.  He also sent another post that
>told me to remove the item regarding Knox County schools because
>upon checking the figures, he had found that the findings were not
>consistant across all subjects, although improvements had occurred in some.

Please correct me if I am wrong about this: Bill Sanders writes that they have observed changes in effects whenever they "have had direct knowledge of change in educational practices." Knox County is one of those systems where UTVARAC has knowledge of such changes. Knox County does not show such uniform improvement. Sanders asked Horn to delete the Knox County data from the response but, and again correct me if I am misreading this, he did not revise the broader claim to reflect his new reading of the data.

Considering that he acknowledges that Knox County does not fit his prior statement, I would like to know what his current claim about this form of validity is.

From: Leslie McLean

Reply to William Sanders regarding hierarchical linear models (and multilevel linear and non-linear models) and his kind invitation to visit Knoxville.
Alas, Dr. Sanders, it is mid-crunch-term and I am up to my fingertips in students, on- and off-line. Your invitation is much appreciated, and I hope some of your fellow correspondents on EDPOLYAN will be able to take you up on your generous offer. As soon as classes end here (mid-April--eat your heart out American colleagues), I'm off for an isolated cottage on Prince Edward Island (off Canada's East Coast) to finish a book about teachers and teaching with my dear friend and colleague, Prof. Johan Aitken. I hope you will keep up the dialogue with Harvey Goldstein and others who can appreciate the work you are doing. I have read the paper in J. Pers. Eval. in Educ. and agree with Gene that it does not answer (as it could not) all the important questions. I have not seen the Amer. Statistician article. My position is still this:
When it comes to the validity of numbers reported to schools, teachers and students (not to speak of local newspapers, who often get the numbers whether you send them copies or not):

The details are everything: items, scaling, regression models, hierarchical models, treatment of missing data, rules for inclusion/exclusion, reporting formats and details (with caveats, whether ignored or not), follow-up meetings in districts/schools to discuss interpretations and releases to the press.
The content of the tests is important, but not as important as the way it is tested: the pedagogy suggested by the form(s) of high-stakes tests. NOWHERE is this more important than in the testing of language, and nowhere are the tests weaker than in this area. Most standardized (read 'published') tests CANNOT (and hence do not) test competence/proficiency in first language according to the best current pedagogy. In short, the "Reading" scores predict other reading scores but not whether students can read their textbooks--and especially not whether they can and do read books/newspapers/magazines. The scores and their non-linear scaled transformations tell us something about the level of literacy of students, but they do not, and cannot, tell teachers how to help the students who attain low scores. Since I make this claim, I make the corollary claim that gain scores are equally unhelpful to teachers. I offer some simple rules in this enormously complex enterprise:
1. ANYTHING that demeans and diminishes respect for teachers is BAD.
2. Test scores published in newspapers demean and diminish respect for teachers, though this is almost never the purpose of the publication.
3. The validity of a test score diminishes by the cube of the distance of the test constructor from the classroom.
4. The credibility of a derived achievement test score increases by the square of the number of terms in its predictive equation, multiplied by 10 times the number of times "exp" appears in the formula (whether direct or implied by 'log').
5. The correlation between credibility and validity is -0.14161828. Corollary: The credibility of scores arrived at by item-response models is approaching infinity so this may diminish the accuracy of the above correlation estimate.
6. The demand for published test scores is insatiable, esp. IRT scores.
7. Discussion on public forums such as EDPOLYAN is pointless.
----------------------------------------------------------------
From: Harvey Goldstein
Returning after a few days absence I find another 100 or so mesages about TVAA which I will slowly work my way through. However a few people have referred to the 'mixed model' used by TVAA so it might be worth trying some clarification of this (though when I thought I was trying to be helpful about standard errors it seemed to create more confusion than clarification!) The mixed model, elaborated by Henderson, Harville and others and the subject of the American Statistician paper (1991) by (Robert) McLean et al is in essence just the ordinary multiple regression model where the coefficients are allowed to have (random) distributions across the units defined in the model. Thus, e.g., if we have a model with students and schools identified, we can take a coefficient in the model, say a pretest score coefficient and specify that it varies randomly across schools and then estimate its variance. If we have several of these we can also estimate covariances among them. I include here the general linear model extension to ordinary multiple regression which allows factors, such as gender, or, say, type of school to be included (as dummy variables). Then you could see whether the gender coefficient (representing boy-girl difference) varies from school to school. In the TVAA case the mixed models of interest are known as hierarchical models or multilevel models because, as I have described in my example, they model a nested or hierarchical structure where students are grouped within achools and you have between student and between school variation. As I understand it (and the TVAA group and I are corresponding about the technical details) the basic TVAA model is what is known as a repeated measures model where the same students are repeatedly measured and you formally model measurements grouped within students, themselves grouped within schools. This is a 3-level structure. The advantage of this formulation is that you can isolate the influences at each level and study the factors which may explain the variations. As a by-product you can also estimate 'value added' scores, for example for teachers or schools - together with standard errors(!).
In the mid 1980's a number of investigators (Aitkin and Longford, Bryk and Raudenbush, my group in London)worked out efficient computational procedures that allow very complex and large datasets to be analysed efficiently and have led to important new insights. At least 3 software packages exist and the big groups (eg SAS) are also now getting in on the act. Ita Kreft did a review in 1990 and has an update due out soon in the Americal Statistician of these packages. The journal of educational and behavioural statistics is about to produce an issue on this also and there is a large literature about, with applications in education and elsewhere, including a few expository texts. If anyone is interested I can supply some introductory references, and the recent book by Bryk and Raudenbush would be a good start (hierarchical linear models, Sage, 1992). The Mclean et al article is not in my view a good introduction and also doesn't mention any of the important multilevel literature since 1986.
In short, 'mixed models' are neither obscure nor really difficult for anyone with a basic understanding of multiple regression - and there are efficient, publicly available software packages which are being used by large numbers of quantitative social, biological and medical scientists. Multilevel models have come to be recognised as the basic statistical technique in school effectiveness studies and there is a growing literature there too.
From: Sandra P Horn
Subject: W. R. Saffold's Questions
William L. Sanders offers these responses to W. Robert Saffold's questions about TVAAS:
Saffold states: "I HAVE TYPED THIS ALMOST VERBATIM FROM THE 1994 TVAAS REPORT-- I hope that I have typed all the numbers correctly. Apparently, there is a computerized copy of the data available, but the TVAAS staff have fixed it so that no data can be manipulated or copied electronically. I would love to hear their explanation/justification for this."
These files are available in ASCII format for easier transfer to other computers. The ASCII formatted data is available on request from the UTVARAC. Many school systems have already availed themselves of this opportunity.
Saffold: "Dr. Sanders' position seems to be that all variations are the result of educational practice (unless someone else proves otherwise). The huge variations from year to year would seem to undercut his position. If the same teachers are teaching (presumably) in much the same way each year, what accounts for the variation in gains?"
There are 138 school systems in our state. Each system gets a report for each of five subjects. Of the 690 reports, there certainly are many examples of considerable difference from year to year. Yes, I attribute most of the differences (above and beyond the 'noise' in the entire process) to changes in educational practice. Some of the changes may be known by the locals, some may be easily identified, while others may be much more subtle and much harder to identify. Let me share some of the reports which I have received from educators within systems that have been addressing some of the more obvious differences.
Among other things, they attribute the variations to
1. Less than desirable communication across grade levels. Pretend that two years ago that a system/school had retarded gains in math from second to third grade. Assume that the high achieving students in the third grade were permitted to progress at a more rapid pace. But assume that the fourth grade instruction was 'locked-in' and was not extended from where the third-grade faculty 'left-off'. The gain from third to fourth could show some dramatic change between adjacent years.
2. Changes in the structure of the school day. Consider a school in which last year the gains in science were low but the gains in language arts were quite high. If the school decides, as a result of this, to reallocate time so that more time is spent on science at the expense of another subject area, both subjects may show a change from one year to the next.
These are just two of many examples of practices that could result in large changes between adjacent years. However, the school and system main effects are considerably larger than the school*year and system*year interactions.
Saffold asks: "Could the measurement instrument have anything to do with it?"
The test forms are different but equivalent each year. The distributions for each subject and grade over the five years of testing are extremely similar. The means have gone up slightly in some subjects, but the over-all variances are virtually identical. These distributions are based upon the 55-56,000 records obtained from each grade-subject combination each year. The simple r's are about .7 between scores in adjacent grades for each subject.
"Or could the model just be misspecified?"
No model is perfect. However, I can demonstrate that the school and teacher effects are not related to indicators of ses and prior and post levels of achievement.
3. Saffold notes, "I keep reading that TVAAS differentiates between district, school, and teacher effects on student learning. Does the student gain at the classroom level equal the teacher's effect?"
The teacher model is:
```
Y(ijklm) = mu(ijk) + Year*grade*subject*teacher(ijkl) + e(ijklm)
where  i= ith year j= jth grade k=kth subject l=lth teacher
       e(ijklm)=mth student score within i,j,k,l
 
var Y  is ZGZ' + R
 
In this specific model the mu(ijk) is deemed to be the fixed part
of the equation, while year*grade*subject*teacher is the random
part.
 
The general form of the mixed model equations that we use is:
  -                                     -  -   -   -           -
 |   X'*INV(R)*X   X'*INV(R)*Z           | | b |   |X'*INV(R)*Y|
 |                                       | |   | = |           |
 |   Z'*INV(R)*X   Z'*INV(R)*Z  + INV(G) | | u |   |Z'*INV(R)*Y|
 |-                                      -  - -     -          -
 
 The G-matrix contains the covariances among teachers over
grades,subjects and years.
 
The R-matrix contains the covariances among student scores over
all years, subjects and grades.
 
The Y-vector contains the scale scores (not gains) for all
students over all subjects over all years and over all grades for
a school system.
 
The X-matrix contains all fixed effects, either continuous or
discrete variables.
 
The Z-matrix contains all random effects.
 
THE B-VECTOR IS BLUE IF G AND R ARE KNOWN.
THE U-VECTOR IS BLUP IF G AND R ARE KNOWN.
  Harville has called the u's , 'the realized value of the random
variable', a label that I personally like.
 
If G and R are estimated from the data, then B and U are often
referred to as empirical BLUE  and BLUP.  These estimations are
usually completed with REML, or as in our case with close
approximations for G and R for computing reasons.
```
Upon inspection, it is easy to see that the number of equations to be solved for a large system like Memphis will number into the tens of thousands.
The U's sum to zero for each grade and each subject. These are 'shrinkage' estimates and give a direct measure of each teacher's effect as the deviation from the system mean for each subject each year. By choosing estimable functions correctly, these estimates can be scaled as 'gains'. Several points: with little information the estimates are 'pulled' close to the average. This adds considerable protection against someone having an unfair estimated effect due to chance; in fact, an estimate cannot be judged different from average unless the effect is extreme and until considerable information is accumulated. The sensitivity of this process accrues because we are using the considerable correlation structure that accumulates by simultaneously considering all records for each student and fitting all teacher effects simultaneously (a comment that McLean felt that I was fobbing him with).
(Note: We encode Z in various ways to accommodate various forms of teaching, ie. self-contained classroom, departmentalized instruction, team teaching, teaching across grades, changing assignments over time, etc. Also, we have developed a procedure that we call our stacked-block concept, which I won't go into here, that adds considerable improvement to the sensitivity of the estimates).
(Second note: To all of you HLM modelers, I am fully aware of these models. I recognize that they are a sub-set of the general form of mixed model theory and methods. I don't agree with a recent post of McLean that there is a new theory and methodology that surpasses mixed models. They have been developed to deal with a different set of problems when covariables are included at different places on the hierarchy).
Saffolf asks, "Does the student gain at the school level equal the school's effect? Does the student gain at the system level equal the system's effect?"
The school model is like the teacher model except it is included as a fixed effect.
4. Saffold says, "Dr. Sanders admits that we don't even have a measure of student ability."
We don't have a DIRECT measure of student ability. However, the evidence strongly supports that by including all of the covariance structure that an unbiased measure of the teacher effect is obtained. The state-wide analysis clearly shows that the group of students in Tennessee which are not making as much gain in most of our systems is NOT the low achieving students. Rather it is the HIGH achieving students. Some of my colleagues are nearly through with a manuscript that will document this fact.
Saffold asks, "(And if ability--or lack thereof--doesn't matter, why are special education students excluded from teacher assessment?)"
That part of the EIA was suggested by others.
Saffold says, "Dr. Sanders believes that low-achieving students (who may or may not be low ability students) can make "satisfactory gains" consistent with the national norm gain---but a low-ability student may not be able to do so, at least at the same rate as a higher-ability student (and the implicit assumption in the TVAAS model seems to be that all students should learn at the same rate--the national norm gain)."
No. The implicit assumption is that all students eligible to take the TCAP tests can achieve gains at least equal to the norm gain if they are provided effective instruction from where they enter the classroom, academically.
Saffold says,"Thus teachers might have to work much harder with a group of low-ability students in order to achieve acceptable gain scores. This discriminates against teachers with low-ability (as opposed to low-achieving) students."
I find it interesting that you and others have expressed concern about teachers of low achieving (or low ability) students being penalized by the TVAAS process. In fact, most of the concerns that we have received have come from teachers, principals, etc. within systems and schools with a disproportionate number of high achieving students. In fact, many educators who are working with populations of students with lower achievement (and lower abilities) have expressed to me that they are delighted with TVAAS because for the first time there is some documentation for the public to see that their students are making creditable progress.
5. Saffold asks, "Will the TVAAS staff please provide information about the correlation between the results of the norm-referenced test and the criterion-referenced test?"
We have not done a thorough analysis of this. The state- wide system criterion referenced data are available from State Testing.
"It would help to know how well the norm-referenced test matches the Tennessee curriculum."
That question has been recently raised and a thorough review of that issue has been completed. The correlation is extremely high in all subject areas. If you want more information, I would contact someone on the staff of the State Board of Education.
We hope that this answers the points you have raised.
From: Sherman Dorn
Subject: Assessment
In a post responding to me on the comparative reform questions I've raised, Rick Garlikov raised an important question about assessment:
```
>     Finally, Sherman, in previous posts, does not seem to expect
>the same problems with internal assessments that he tends to
>expect from external ones.  I find that odd.  As I said in a
>previous post, if it is acceptable for sixth grade teachers to
>assess what their new students have been taught previously, why
>is it not acceptable for the state or for private assessment
>enterprises to monitor as students go along, in order to report
>to parents?
 
```
I see nothing fundamentally wrong with reporting progress data to parents. I see everything wrong with the way that the state legislature in Tennessee established TVAAS. The devil is in the details in this matter.
Consider, for example, whether TVAAS establishes a high-stakes environment, and whether teachers see it as a relatively non-pressured form of feedback (which is how UTVARAC staff argue it should be used ideally). The legislature created TVAAS' legal framework in a way that made it very clear to teachers what the point was. In 1991 and early 1992, then-governor Ned McWherter was trying to craft a major education financing reform (including a state income tax to fund it) in response to a successful finance equity suit brought by districts around the state. There is quite a bit of evidence that TVAAS was seen as the accountability "bait" to get legislators to go along with the tax hike. The state consulted with Bill Sanders about TVAAS, and the chief sponsors of the legislation quickly accepted an amendment by state senate education chair Ray Albright to include TVAAS in the bill.
The state commissioner of education at the time, Charles Smith, stated publicly that value-added assessment would give Tennessee the best accountability mechanism in the nation, and it would be part of a carrot-and-stick approach to education reform. Lamar Alexander (then Secretary of Education) approved of McWherter's bill precisely because of TVAAS as an accountability mechanism. Several legislators tried to amend the bill to remove everything EXCEPT TVAAS from the reform bill. Others tried to amend the bill to include explicit cut-offs at which administrators would be removed or to make teacher effect estimates public. (Sanders is on record as having opposed the latter type of amendments.)
In the end, McWherter did not get his state income tax, but did get a half-cent hike in the state sales tax and most of what he proposed plus TVAAS.
All of this was reported in the Tennessee Education Association newsletter at the time. I don't think this could have escaped notice among most public school teachers that many legislators may have voted for the increased funding for schools only because there was TVAAS attached, or what the proposed amendments represented.

Sunday, April 26, 2020

An Archaeological Dig for VAM