I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Sunday, May 21, 2017

Random Thoughts On Using VAM for Teacher Evaluation

You may have read the piece in the New York Times today by Kevin Carey on the passing of William Sanders, the father of idea of using value-added modeling (VAM) to evaluate teachers. Let me first offer my condolences to his family.

I'm going to skip a point-by-point critique of Carey's piece and, instead, offer a few random thoughts about the many problems with using VAMs in the classroom:

1) VAM models are highly complex and well beyond the understanding of almost all stakeholders, including teachers. Here's a typical VAM model:


Anyone who states with absolute certainty that VAM is a valid and reliable method of teacher evaluation, yet cannot tell you exactly what is happening in this model, is full of it.

There was a bit of a debate last year about whether it matters that student growth percentiles (SGPs) -- which are not the same as VAMs, but are close cousins -- are mathematically and conceptually complex. SGP proponents make the argument that understanding teacher evaluation models are like understanding pi: while the calculation may be complex, the underlying concept is simple. It is, therefore, fine to use SGPs/VAMs to evaluate teachers, even if they don't understand how they got their scores.

This argument strikes me as far too facile. Pi is a constant: it represents something (the circumference of a circle divided by its diameter) that is concrete and easy to understand. It isn’t expressed as a conditional distribution; it just is. It isn’t subject to variation depending on the method used to calculate it; it is always the same. An SGP or a VAM is, in contrast, an estimate, subject to error and varying degrees of bias depending on how it is calculated.

The plain fact is that most teachers, principals, school leaders, parents, and policy makers do not have the technical expertise to properly evaluate a probabilistic model like a VAM. And it is unethical, in my opinion, to impose a system of evaluation without properly training stakeholders in its construction and use.

2) VAM models are based on error-prone test scores, which introduces problems of reliability and validity. Standardized tests are subject to what the measurement community often calls "construct-irrelevant variance" -- which is just a fancy way of saying test scores vary for reasons other than knowing stuff. Plus there's the random error found in all test results, due to all kinds of things like testing conditions. 

This variance and noise causes all sorts of issues when put into a VAM. We know, for example, that the non-random sorting of students into teacher classrooms can create bias in the model. There is also a very complex issue known as attenuation bias when trying to deal with the error in test scores. There are ways to ameliorate it -- but there are tradeoffs. 

My point here is simply that these are very complicated issues and, again, well beyond the apprehension of most stakeholders. Which dictates caution in the use of VAM -- a caution that has been sorely lacking in actual policy.

3) VAM models are only as good as the data they use -- and the data's not so great. VAM models have to assign students to teachers. As an actual practitioner, I can tell you that's not as easy as it sounds. Who should be assigned a VAM score for language arts when a child is Limited English Proficient (LEP): the ELL teacher, or the classroom teacher? What about special education students who spend part of the school day "pulled out" of the homeroom? Teachers who team teach? Teachers who co-teach?

All this assumes we have data systems good enough to track kids, especially as they move from school to school and district to district. And if the models include covariates for student characteristics, we need to have good measures of students' socio-economic status, LEP status, or special education classification. Most of these measures, however, are quite crude.*

If we're going to make high-stakes decisions based on VAMs, we'd better be sure we have good data to do so. There's plenty of reason to believe the data we have isn't up to the job.

4) VAM models are relative; all students may be learning, but some teachers must be classified as "bad." Carey notes that VAMs produce "normal distributions" -- essentially, bell curves, where someone must be at the top, and someone must be at the bottom.


I've labeled this with student test scores, but you'd get the same thing with teacher VAM scores. Carey's piece might be read to imply that it was a revelation to Sanders that the scores came out this way. But VAMs yield normal distributions by design -- which means someone must be "bad."

General Electric's former CEO Jack Welch famously championed ranking his employees -- which is basically what a VAM does -- and then firing the bottom x percent. GE eventually moved away from the idea. I'm hardly a student of American business practices, but it always struck me that Welch's idea was hampered by a logical flaw: someone has to be ranked last, but that doesn't always mean he's "bad" at his job, or that his company is less efficient than it would be if he was fired.

I am certainly the last person to say our schools can't improve, nor would I ever say that we have the best possible teaching corps we could have. And I certainly believe there are teachers who should be counseled to improve; if they don't they should be made to leave the profession. There are undoubtedly teachers who should be fired immediately.

But the use of VAM may be driving good candidates away from the profession, even as it is very likely misidentifying "bad" teachers. Again, the use of VAM to evaluate systemic changes in schooling is, in my view, valid. But the argument for using VAM to make high-stakes individual decisions is quite weak. Which leads me to...

5) VAM models may be helpful for evaluating policy in the aggregate, but they are extremely problematic when used in policies that force high-stakes decisions. When the use of test-based teacher evaluation first came to New Jersey, Bruce Baker pointed out that its finer scale, compared to teacher observation scores, would lead to making SGPs/VAMs some of the evaluation but all of the decision.

But then NJDOE leadership -- because, to be frank, they had no idea what they were doing -- made teacher observation scores with phony precision. That led to high-stakes decisions compelled by the state based on arbitrary cut points and arbitrary weighting of the test-based component. The whole system is now an invalidated dumpster fire.

I am extremely reluctant to endorse any use of VAMs in teacher evaluation, because I think the corrupting pressures will be bad for students; in particular (and as a music teacher), I worry about narrowing the curriculum even further, although there are many other reasons for concern. Nonetheless, I am willing to concede there is a good-faith argument to be made for training school leaders in how to use VAMs to inform, rather than compel, their personnel decisions.

But that's not what's going on in the real world. These measures are being used to force high-stakes decisions, even though the scores are very noisy and prone to bias. I think that's ultimately very bad for the profession, which means it will be very bad for students.

Carey mentions the American Statistical Association's statement on using VAMs for educational assessment. Here, for me, is the money quote:
Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability to teachers. This is not saying that teachers have little effect on students, but that variation among teachers accounts for a small part of the variation in scores. The majority of the variation in test scores is attributable to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences. 
The VAM scores themselves have large standard errors, even when calculated using several years of data. These large standard errors make rankings unstable, even under the best scenarios for modeling. Combining VAMs across multiple years decreases the standard error of VAM scores. Multiple years of data, however, do not help problems caused when a model systematically undervalues teachers who work in specific contexts or with specific types of students, since that systematic undervaluation would be present in every year of data. 
A VAM score may provide teachers and administrators with information on their students’ performance and identify areas where improvement is needed, but it does not provide information on how to improve the teaching. The models, however, may be used to evaluate effects of policies or teacher training programs by comparing the average VAM scores of teachers from different programs. In these uses, the VAM scores partially adjust for the differing backgrounds of the students, and averaging the results over different teachers improves the stability of the estimates [emphasis mine]
Wise words. 

NJ's teacher evaluation system, aka "Operation Hindenburg."



* In districts where there is universal free-lunch enrollment,parents have no incentive to fill out paperwork designating their children as FL-eligible. So even that crude measure of student economic disadvantage is useless.

Saturday, May 20, 2017

U-Ark Screws Up A Charter School Revenue Study, AGAIN: Part II

Here's Part I of this series.


If this is true, it's really disturbing:
Colorado’s General Assembly on Wednesday passed a bill giving charter schools the same access to a local tax funding stream as district schools have, The Denver Post reported.
The bipartisan compromise measure, which supporters say is the first of its kind in the nation, would address an estimated $34 million inequity in local tax increases. It came a day after the University of Arkansas released a study that found charter schools receive $5,721 less per pupil on average than their district counterparts — a 29 percent funding gap. [emphasis mine]
It is, of course, standard operating procedure for outfits like the U-Ark Department of Education Reform to claim their work led to particular changes in policy; that's how they justify themselves to their reformy funders.  Maybe the connection between the report and the Colorado legislation (which is really awful -- more in a bit) is overblown...

But if the U-Ark report did sway the debate, that's a big problem. Because the report is just flat out wrong. 

As I explained in Part I, the claim that Camden, NJ, has a huge revenue gap between charters and public district schools seems to be based on an utterly phony comparison: all of the revenue, both charter and district, is linked to only the CCPS students -- not the charter students. Because the data source documentation in the report is so bad, I can't exactly replicate U-Ark's figures, so I invited Patrick Wolf and his colleagues to contact me and explain exactly how they got the figures they did.

So far, they remain silent.

But that isn't surprising. When U-Ark put out its first report in 2014, Bruce Baker tore it to shreds in a brief published by the National Education Policy Center. The latest U-Ark report cites Baker's brief, so they must have read it -- but they never bothered to answer Baker's main claim, which is that their comparisons are wholly invalid.

Further, what I documented in the last post is only one of the huge, glaring flaws in the report. Let me point out another, using Camden, NJ again as an example. We'll start by looking U-Ark's justification for using the methods they do:
This is a study of the revenues actually received by public charter schools and TPS. Revenues equal funding. Revenues signal the amount of resources that are being mobilized in support of students in the two different types of public schools. Some critics of these types of analyses argue that our revenue study should, instead, focus on school expenditures and excuse TPS from certain expenditure categories, such as transportation, because TPS are mandated to provide it but many charter schools choose not to spend scarce educational resources on that item. [emphasis mine]
"Choose" not to spend the revenues? Sorry to be blunt, but that statement is either deliberately deceptive or completely clueless.* In New Jersey, hosting public school districts are required to provide transportation for charter school students. The charters don't "choose" not to spend on transporting the kids; they avoid the expense because the district picks up the cost.

Baker pointed this out explicitly in his 2014 brief -- but U-Ark, once again, refuses to acknowledge the problem, even though we know for a fact they read Baker's report, because they cite it repeatedly.

And it gets worse.


For the sake of illustration, here's a simplified conceptual map of what Camden's public district school bus system might look like. We've got neighborhood schools divided into zones, and buses transporting children to their neighborhood school.** There are exceptions, of course, primarily for magnet and special needs students, but the system on the whole is fairly simple.

Now let's add some "choice":



There has been a marked decline in "active transportation" -- walking or biking -- to school over the past few decades, and school "choice" is almost certainly a major contributor. As we de-couple schools from neighborhoods (which may well have many other pernicious effects), transportation networks become more complex and more expensive.


As I said: New Jersey law requires public school districts like Camden to pay for transportation of charter school students. Which means all of these extra costs are borne solely by the district.


And how much does this cost Camden's charter sector?

So any comparison of revenues that doesn't exclude transportation -- and, again, it appears that U-Ark didn't exclude it, although their documentation is so bad we can't be sure -- is without merit. Claiming that charter schools have a revenue gap when they use services paid for by public district schools makes no sense.

Folks, this issue is so simple that it doesn't require an advanced understanding of school finance or New Jersey law to understand it. Which makes it all the more incredible the U-Ark team didn't account for it in their findings. And again: if the Colorado Assembly made their decision to raise the funding for charters -- at the expense of public district schools -- on the basis of a report that is this flawed...

Let's take a look at some better -- not perfect, but better -- financial comparisons between Camden's charters and CCPS next.



 * Granted, it might be both...

** It's worth noting that in a dense city like Camden, many of the students will be within walking distance of their neighborhood school. But when you introduce "choice," you make the school system much less walkable, because students are likely traveling greater distances. I was at a conference at Rutgers yesterday where researchers were looking into this issue -- more to come...

Tuesday, May 16, 2017

U-Ark Screws Up A Charter School Revenue Study, AGAIN: Part I

As someone who spends a good bit of his time debunking many of the claims of the education reformsters, one continuing frustration is how many of them don't seem to learn their lessons. Certainly, we can have good faith debates about education policy, and reasonable people can disagree on many things...

But when you've been called out in public for making a big mistake, and you don't at least attempt to correct yourself... well, it's hard to take you seriously -- even if other, less discriminating minds do.

We parents all have heard the claim that something wasn’t fair. “Suzie got a bigger piece of cake than I did!” “Tommy got to go fishing while I had to clean the garage!” “Malachi had a lot more money spent on his education because you sent him to a traditional public school and me to a public charter school!” Well, maybe we haven’t actually heard that last one very often but it would be a more legitimate gripe than the other ones. 
Students in public charter schools receive $5,721 or 29% less in average per-pupil revenue than students in traditional public schools (TPS) in 14 major metropolitan areas across the U. S in Fiscal Year 2014. That is the main conclusion of a study that my research team released yesterday.
This is from the crew at the University of Arkansas's "Department of Education Reform" -- yes, there is such a thing, I swear -- led by the author here, Patrick Wolf. The study Wolf's team produced purports to show that charters are getting screwed out of the revenues they deserve, which are instead flowing to public district schools.

(Side note: if charters "do more with less," why do they need the same money as public district schools? Isn't that part of their "awesomeness"?)

But here's the thing: the methods this study uses are similar to a study they produced back in 2014 -- a study that was thoroughly debunked a month later. In a brief published by the National Center for Education Policy, Dr. Bruce Baker* notes that even if we put aside many problems the U-Ark study has with documenting its data sources and explaining its methodologies, one enormous flaw renders the entire report useless:

As mentioned earlier, the major issue that critically undercuts all findings and conclusions of the study, and any subsequent “return on investment” comparisons, is the report’s misunderstanding of intergovernmental fiscal relationships. Again, as the authors note, they studied “all revenues” (not expenditures), because studying expenditures, while “fascinating” would be “extremely difficult” (Technical appendix, p. 385). 
Any “revenue per pupil” figure includes two parts that may significantly affect the figure. What goes into the total revenue measure? And how are pupils counted? If one’s goal is to compare “revenues per pupil” of one entity to another, one must be able to appropriately align the correct revenue measure with the correct pupil measure for each entity. That is, for the district, one must identify the revenues intended to provide services to the district’s pupils and revenues intended to provide services to the charter school’s pupils. If numbers are missed orworse yetwrongly attributed, the comparison becomes invalid and misleading. [emphasis mine]
Baker cites several examples of how U-Ark gets this basic idea wrong time and again -- including, in his first example, U-Ark's analysis of Newark, NJ:
One can get closer to the $28,000 figure by dividing total revenue for that year by the district enrollment, excluding sent pupils (charter school, out of district special education, etc.). But this would be particularly wrong and the result substantially inflated because the numerator would include all revenues for both district and sent charter students, but the denominator would include only district students
Again, Baker pointed this out in 2014. But guess what? By all appearances, U-Ark made the same mistake once again in 2017. Let me see if I can explain this with a few pictures.


Unlike the U-Ark report, I'm going to tell you exactly where I'm getting my data for all these slides. The school year is 2013-14, just like the U-Ark report. The fiscal data comes from the User-Friendly Budget Guide** published by the New Jersey Department of Education; U-Ark says in its data comes from the NJDOE, so the figures should be the same. I get my charter enrollment numbers from the Enrollment data of the NJDOE, using the 2013-14 files.

There were, according to these sources, 17,273 students in Camden's total enrollment for 2013-14. These include contracted pre-school and out-of-district placements, which we will set aside for now (even though that is a deeply flawed thing to do -- more later). If we take the total full-time enrollment -- 15,546 -- and subtract 4,251 charter students, we get 11,295 Camden City Public School students.



Total revenues for the district in that year were $369,770,349. This included $54,902,533 in transfers to the Camden charter schools. Understand that this was not necessarily the only source of revenue for the charters, who might also collect funds directly from the federal government or from private sources. It's also worth pointing out here that all 4,251 Camden charter students may not come from Camden (although it's safe to assume the vast majority are city residents). But, as we'll see, that doesn't matter anyway.



Using these figures, U-Ark steps in to make its per pupil calculations. In the numerator is the revenue  collected by the district or the charter schools; in the denominator are the students enrolled in each sector.

See the problem?


If we use all of the $370 million in the district's per pupil figure, but we only count the students in CCPS and not the charters, we wind up double-counting about $55 million. Because that money is in both the district per pupil figure and in the charter figure.

Even U-Ark admits they should not do this:


That $370 million figure -- a figure, by the way, that is deeply flawed (more in Part II), should not be the figure that U-Ark uses to calculate CCPS's per pupil figure. I'm not saying this: U-Ark is.

So did they?

My calculation using these figures comes out to $32,738 -- which is very close to U-Ark's figure of $32,569.


Like I said, there are several reasons the figures don't match exactly: precise charter enrollment figures, including various students in out-of-district placements, minor adjustments to the revenue, etc.

But it's clear that Wolf and his U-Ark team used the wrong revenue figure when making their calculation of Camden's per pupil spending; worse, they made the same mistake they made in 2014, even after they had been publicly corrected!



(Side note: we know they read Bruce Baker's review of their earlier report, because they cite it multiple times.)

Now, as I'll explain in the next post, fixing this problem still makes for a deeply flawed analysis. But let's suppose, just for illustration purposes, they had corrected it. What would the figure be?


Here, we subtract the charter school transfer (find it on page 5 of the User-Friendly Summary). Which, according to U-Ark themselves, is the correct way to approach the calculation. What's the result?

Again, this is a deeply flawed comparison. But it's a much smaller gap than using U-Ark's methods.

Let me end this part by addressing Professor Wolf and his team directly:

Gentlemen, I have shown in this post exactly where my data came from. Maybe you have different, equally credible sources. None of us would know, however, because your sole citation for state data is: "New Jersey Department of Education, School Finance." (p. 33) If you'd care to share your sources, your data, and how you arrived at your calculations (in appropriate detail to allow for replication, a common standard in our field), then please do; I'll happily publish them here. You can reach me at the email address on the left side of the blog.

But as it stands right now, there is more than enough evidence, in my opinion, to entirely dismiss your report and its conclusions.

Part II in a bit...



ADDING: Previous atrocities have been documented. 



* As always: Bruce is my advisor in the PhD program at Rutgers GSE.

** I use the 2015-16 guide because it gives the latest "actual" figures for 2013-14 available from NJDOE.

Thursday, May 11, 2017

Attrition in Denver Charter Schools

Earlier this month David Leonhardt of the New York Times wrote yet another column extolling the virtues of charter schools. I feel like a broken record when I say, once again, that education policy dilettantes like Leonhardt don't seem to understand that it requires more than a few studies showing a few charters in a few cities in a few select networks get marginally better outcomes on test scores to justify large-scale charter expansion.

There are serious cautions when it comes to the proliferation "successful" charters, starting with the fiscal impact on hosting districts as charters expand. We should also be concerned about the abrogation of student and family rights, the lack of transparency in charter school governance, the narrowing of the curriculum in test-focused charters, the racially disparate disciplinary practices in "no excuses" charters, and the incentives in the current system that encourage bad behaviors.

But let's set all that aside and look at the evidence Leonhardt presents to justify his push for more charters:
Unlike most voucher programs, many charter-school systems are subject to rigorous evaluation and oversight. Local officials decide which charters can open and expand. Officials don’t get every decision right, but they are able to evaluate schools based on student progress and surveys of teachers and families. 
As a result, many charters have flourished, especially in places where traditional schools have struggled. This evidence comes from top academic researchers, studying a variety of places, including Washington, Boston, Denver, New Orleans, New York, Florida and Texas. The anecdotes about failed charters are real, but they’re not the norm.
You'll notice that Leonhardt picks cities and states that uphold his argument while excluding others like Detroit, Philadelphia, and Ohio. In addition: I spent a lot of time last year explaining why the vaunted Boston charter sector isn't all it's cracked up to be. I've also documented the mess that is Florida's charter sector. I'll try to get to some of Leonhardt's other examples, but for now: let's talk about Denver.

I'll admit it's one region where I haven't spent much time looking at the charter sector. Leonhardt links to a study that shows some significant gains for charters... although I have some serious qualms about the methodology used in the report. I'm working on something more formal which addresses the issue, but for now (and pardon the nerd talk): I am increasingly skeptical of charter effect research that uses instrumental variables estimators to pump up effect sizes. So far as I've seen, the validity arguments for its use are quite weak -- more to come.

For now, however, let's concede the Denver charter sector does, in fact, get some decent test score gains compared to the Denver Public Schools. The question, as always, is how they do it. Do they lengthen their school day and school year? If so, that's great, but we could do that in the public schools as well. Do they provide smaller class sizes and tutoring? Again, great, but why do we need schools that are not state actors to implement programs like that?

What we want to find are reasons that we can attribute only to the governance structure of charters -- not to resource differences, not to student population differences, but to the inherent characteristics of charters themselves.

And one thing I've found, time and again, is that one of the characteristics of "successful" charters is that they engage in patterns of significant student cohort attrition.


Let me explain what's going on here: this is data for the DSST network, one of the more lauded groups of charter schools in Denver. We're looking at the "class" of each cohort that has come through the entire charter chain; in other words, how big the Class of 2014 was when they were freshman, then sophomores, then juniors, and then seniors. I've done the same with each class back to 2008.

See the pattern? As DSST student classes pass through the charter schools year-to-year, the number of students enrolled shrinks considerably. The Class of 2014, for example, is 62 percent of the size as seniors as it was when freshmen. The shrinkage ranges from 61 to 73 percent over the eight years on the graph.

Where do the kids who leave go? Many likely go back to the Denver Public Schools. Some of those likely drop out, which counts against DPS's graduation rate -- but not the charter schools'. In any case, they aren't being replaced, which I find odd considering how supposedly "popular" charters are.

Some make the case that the larger freshman classes are due to retention: the schools keep the kids for an extra year to "catch them up." Which I suppose is possible... but it raises a host of questions. Do public students have the same opportunities to repeat a grade? Are the taxpayers aware they are paying for this? Why is there still significant attrition between Grade 10 and Grade 11?

Let's look at some other Denver charters and their cohort attrition patterns. Here's KIPP, the esteemed national charter network:


They haven't been running high schools as long as DSST, but the patterns are similar. KIPP's history is as a middle school provider; here are their attrition patterns in the earlier grades:



KIPP's Grade 8 cohorts shrink from 73 to 84 percent of their size in Grade 5. Again: if they're so popular and have such long wait lists -- and if the DPS schools are so bad -- why aren't they backfilling their enrollments? Note too that much of the attrition is after Grade 6. Most Denver elementary schools enroll Grades K to 5. It doesn't appear as if many students come into KIPP looking to move on after only one year; most of the attrition is in the later grades. Why would kids be leaving in the middle of their middle school experience?

Another middle school provider moving into high school is STRIVE:


Grade 8 is between 56 and 80 percent the size of Grade 6. Let's look at one more: Wyatt Academy.


The last class we have data for shrank to 69 percent of its size in First Grade 1 by the time it got to Grade 8.

Let's be clear: cohort shrinkage occurs in DPS as well.


The last year for which we have data was an outlier: the Class of 2018 was 75 percent as big in Grade 8 as it was in Grade 5. For previous years, that figure ranges from 81 to 90 percent. The comparisons to the charters are admittedly tricky: the transition from Grade 5 to 6, for example, is sure to see students moving out of the area or into the private schools, both from DPS and the charters. 

But it's still striking to me that "popular" charters, which are allegedly turning away lottery losers, seem to lose more students proportionally than the "failing" DSP schools.



DPS has a large number of students leave their Grade 9 cohort before Grade 12. Many are dropouts, and that's a serious problem. But why does DPS get slammed for this while the charter high schools are declared "successful" even as they are losing at least as large a proportion of their students as the public high schools?

Again, this is tricky stuff. I'm certainly not going to declare that Denver's charter sector is getting all of its gains from pushing out the lower performers; we don't have nearly enough evidence to make that claim. But neither can we declare definitively, as Leonhardt does, that charter "...success doesn’t stem from skimming off the best." When you lose this many students, particularly in high school, you have to back up and take a more critical view of why some charters get the gains that they do.

One more thing: look at the y-axes on my graphs. The scale of Denver charter school enrollments is nothing like the scale found in DPS. Only recently has STRIVE come around to about 10 percent of DPS's enrollment per class. How can we be sure the gains they make, if any, can be sustained as the sector gets larger?

When charters shed this many kids, there has to be a system that catches them and enrolls them in school. A system that takes them at any time of year, no matter their background. A system that doesn't get to pick and choose which grades it will enroll and when. That system is the public schools; arguably, charters couldn't do what they do without it.

Before we declare charters an unqualified success, we ought to think carefully about whether factors like attrition play a part in helping them realize their test score gains, and what that means for the public school system.

I'll try to get to Denver more this summer. But let's get back to New Jersey next...

Saturday, April 29, 2017

Desperately Searching For the Merit Pay Fairy

It's been a while since we've talked about the Merit Pay Fairy.

Yo, it's me -- da Merit Pay Fairy, makin' all your reformy dreams come true!

The Merit Pay Fairy lives in the dreams and desires of a great many reform-types, who desperately want to believe that "performance incentives" for teachers will somehow magically improve efforts and, consequently, results in America's classrooms. Because, as we all know, too many teachers are just phoning it in -- which explains why a system of schooling that ranks and orders students continually fails to make all kids perform above average...

One of the arguments you'll hear from believers in the Merit Pay Fairy is that teaching needs to be made more like other jobs in the "real world." But pay tied directly to performance measures is actually quite rare in the private sector (p. 6). It's even more rare in professions where you are judged by the performance of others -- in this case, students, whose test scores vary widely based on factors having nothing to do with their teachers

But that doesn't matter if you believe in the Merit Pay Fairy; all that counts is that some quick, cheap fix be brought in to show that we're doing all we can to improve public education without actually spending more money. And, yes, merit pay as conceived by many (if not most) in the "reform" world, is cheap -- because it involves not raising the overall compensation of the teaching corps, but taking money away from some teachers and giving it to others, using a noisy evaluation system incapable of making fine distinctions in teacher effectiveness.

Which brings us to the latest merit pay study, which has been getting a lot of press:
Student test scores have a modest but statistically significant improvement when an incentive pay plan is in place for their teachers, say researchers who analyzed findings from 44 primary studies between 1997 and 2016.
“Approximately 74 percent of the effect sizes recorded in our review were positive. The influence was relatively similar across the two subject areas, mathematics and English language arts,” said Matthew Springer, assistant professor of public policy and education at Vanderbilt’s Peabody College of Education and Human Development.
The academic increase is roughly equivalent to adding three weeks of learning to the school year, based on studies conducted in U.S. schools, and four weeks based on studies across the globe.
Let's start with the last paragraph first: the notion that you can translate this study's effects into "weeks of learning" is completely without... well, merit. Like so much other research in this field, the authors make the translation based on a paper by Hill et al. (2008). I'll save getting into the weeds for later (and in a more formal setting than this blog), but for now:

Hill et al. make their translation of effect sizes into a time periods based on what are called vertically-scaled tests. These are tests that let at least some students attempt to answer at least some common items between concurrent grade levels, allowing for a limited comparison between grades (see p.17 here).

There is no indication, however, that any of the tests used in any of the 44 studies are vertically scaled -- which makes a conversion into "x weeks of learning" an unvalidated use of test scores. In other words: the authors in no way show that their study can use the methods of Hill et al., because the tests are likely scaled differently.

Furthermore: do we have any idea if the tests used in international contexts are at all educationally equivalent to the tests here in the US? For that matter, what are the contexts for the teaching profession, and how it might be affected by merit pay, in other countries? So far as I'm concerned, the effect size we care about is the one found in studies conducted in this country.

That US effect size is reported in Table 3 (p. 44) as 0.035 standard deviations. How can we interpret this? Plugging into a standard deviation-to-percentiles calculator (here's one), we find this effect moves students at the 50th percentile to 51.4.* It's a very tough haul to argue that this is an educationally meaningful effect.

Which brings us to the next limitation of this meta-analysis: the treatment is not well defined. To their credit, the authors attempt to divide up the different studies by their characteristics, but they only do so in the international aggregate. In other words: they report the differences between a merit pay plan that uses group incentives versus a "rank order tournament" (p. 45, Table 4), but they don't divide these studies up between the US and the rest of the world.

Interestingly, group incentives have a greater effect than individual competitions. But there is obviously huge variation within this category in how a merit pay plan will be implemented. For example: where did the funds for merit pay come from? 

In Newark, merit pay was implemented using funds dedicated by Mark Zuckerberg. Teachers were promised that up to $20 million would be available; of course, it turned out to be far less (and it's worth noting that there's scant little evidence Newark's outcomes have improved). Would this program have different effects if the money had not come from an outside source?** What if the money came, instead, from other teachers' salaries (which may, in fact, be the case in Newark)?

Any large-scale merit pay plan will be subject to all sorts of variations that may (or may not) impact how teachers do their jobs. Look at the descriptions in Table 6 (p. 47), which recounts how various merit pay plans affect teacher recruitment and retention, to see just how diverse these schemes are.

I think it's safe to say that "merit pay" in the current conversation is not really about giving bonuses for working in hard-to-staff assignments, or for taking on extra responsibilities, or even for working in a group that meets a particular goal. I'm not suggesting we shouldn't be looking at the effects of programs like this, but I don't think it's helpful to put them into the same category as "merit pay."

I think, instead, that "merit pay" is commonly understood as being a system of compensation that differs from how we currently pay teachers: one where pay raises are based on individual performance instead of experience or credentials. The Chalkbeat article certainly implies this by making this comparison:
Teacher pay is significant because salaries account for nearly 60 percent of school expenses nationwide, and research is clear that teachers matter more to student achievement than any other aspect of schooling (although out-of-school factors matter more). About 95 percent of public school districts set teacher pay based on years of experience and highest degree earned, but merit pay advocates argue that the approach needs to change. [emphasis mine]
Take a look at a sample of articles on teacher merit pay -- here, here, here, here, and here for example -- and you'll see merit pay contrasted with step guides that increase pay for more years of experience or higher degrees. You'll also notice none of the proponents of merit pay are suggesting that the overall amount spent on our teaching corps should increase.

I can understand the point of writers like Matt Barnum who argue that merit pay can come in all sorts of flavors. But I contend we're not talking about things like hard-to-staff bonuses or group incentives: When America debates merit pay, it's really discussing whether we should take pay from some teachers and give it to others.

Unfortunately, by analyzing all of these different types of studies together, the Vanderbilt meta-analysis isn't answering the central question: should we ditch step guides and move to a performance based system? That said, the study may still be giving us a clue: the payoff will likely be, at best, a meager increase in test scores.

Of course, we have to weigh that against the cost -- or, more precisely, the risk. Radically changing how teachers are paid would create huge upheavals throughout the profession. Would teachers who were in their current assignments stay on their guides, or would they potentially take huge hits in pay? If they were grandfathered out of a merit pay scheme, how would they work with new teachers who were being compensated differently?

Would merit pay be doled out on the basis of test scores? How much would VAMs or SGPs be weighted? How would teachers of non-tested subjects be eligible? Would the recipients of merit pay be publicly announced? In New Jersey and many other states, teacher salaries are public information. Would that continue? And how, then, would students be assigned to the teachers who receive merit pay? Will parents get to appeal if their child is assigned to a "merit-less" teacher?

The chaos that would result from implementing an actual merit pay plan is a very high cost for a potential 0.035 standard deviation improvement in test scores.

I know believers in the Merit Pay Fairy would like to think otherwise, but clapping harder just isn't going to make these very real issues go away.

Don't listen to dat Jazzman guy! Just clap harder, ya bums!


ADDING: More from Peter Greene:
Researchers' fondness for describing learning in units of years, weeks, or days is great example of how far removed this stuff is from the actual experience of actual live humans in actual classrooms, where learning is not a featureless tofu-like slab from which we slice an equal, qualitatively-identical serving every day. In short, measuring "learning" in days, weeks, or months is absurd. As absurd as applying the same measure to researchers and claiming, for instance, that I can see that Springer's paper represents three more weeks of research than less-accomplished research papers.
Heh.


* Some folks don't much care for making this kind of conversion. In my view, it's much more defensible than converting to "x weeks of learning," which, even setting aside the problems of converting from vertically scaled tests, suffers from unjustified precision. In addition, the implications behind the translation are subject to wild misinterpretation.

Converting to percentiles might a bit problematic. But it's not nearly as bad as using "x weeks of learning."

** We'll never know because no one has bothered to find out if the Newark merit pay program actually worked. Think about it: $100 million in Facebook money, and no one ever considered that maybe reserving a few thousand for a program evaluation was a good idea.

If I was cynical, I might even think folks didn't want to study the results, because they were afraid of what they might find. Good thing I'm not cynical...

Monday, April 10, 2017

Teacher Tenure and Seniority Lawsuits: A Failure of Logic

New Jersey's teacher tenure and seniority lawsuit continues to grind away. Part of a trio of suits here and in New York and Minnesota, these lawsuits are all being brought to the various state courts by the Partnership for Educational Justice, Campbell Brown's secretly funded organization.

Their Minnesota lawsuit was thrown out of court last fall; in New Jersey, however, we had to wait for a state Supreme Court ruling on a Christie administration motion to tie tenure and seniority laws to school funding. The Court ruled it wasn't going to opine on these laws until a lower court takes up the PEJ's case. So now we wait for that ruling -- and the PEJ continues its public relations campaign against tenure and last in-first out (LIFO) seniority rules.

To their credit, PEJ has posted all of the filings in the case. But it's clear after reviewing them that PEJ doesn't have a leg to stand on. Not to say they won't prevail: bad legal reasoning didn't stop Judge Rolf Treu in California from issuing a terrible ruling in Vergara, which was inevitably overturned on appeal. Similarly, the only way PEJ can win here in New Jersey is if the lower court hearing the case sets aside all logic and reason...

Because the PEJ's case simply makes no sense.

When a group like the PEJ goes before the courts to get a statute overturned as unconstitutional -- by which I mean in violation of the state's constitution, not the federal Constitution -- the burden of proof is on them. They may have a problem with the NJ tenure and LIFO statutes, or any other law on the books, but getting the court to overturn a law isn't simply a case of arguing against the law's merits: they have to show how it violates the state's constitution.

The constitution states (Article VIII, Section IV): "The Legislature shall provide for the maintenance and support of a thorough and efficient system of free public schools for the instruction of all the children in the State between the ages of five and eighteen years." Unless and until the PEJ can demonstrate to the courts that tenure and LIFO laws violate this clause, the courts cannot act.

In a long-running series of cases involving school funding in New Jersey, the NJ Supreme Court found the systemically inadequate and inequitable funding of schools was in direct violation of the education clause. Although the litigation has a long and complex history, the basic premise of the lawsuits is comparatively simple: at-risk children need more funding to equalize educational opportunities, the state's system of school funding makes it impossible for those children's communities to raise adequate funds on their own and, therefore, the state needs to intervene.

In contrast, the challenge for the PEJ is to show how tenure and LIFO laws similarly violate the education clause; even the PEJ's own filing concedes this point. The problem is that right after stating what their legal argument should be, they completely ignore the task. 

Yes, they make the case districts with larger proportions of at-risk children show fewer gains in academic outcomes; no one disputes this. Yes, they make the case teachers matter; no one disputes this (although the canard of teachers being "the most important in-school factor" for student achievement is wrong: the student is the most important "in-school factor," not the teacher). Yes, they make the case ineffective teachers should be dismissed; no one disputes this.

They even go further and argue that the quality of teachers suffers in districts that serve many at-risk students. Certainly, there's strong evidence students in these districts are more likely to have less qualified teachers, as judged by their credentials, experience, or scores on knowledge tests (teacher scores, not students). 

But none of this speaks to the central argument PEJ is trying to make:
Hill: The Newark Teachers Union says — about the comment like that — these folks who are tenured, they’ve been through a certain process and if the process determines that they’re no longer an effective teacher, the process has a way of dealing with them. You say? 
[Ralia] Polechronis [PEJ Executive Director]: So, that’s not entirely what we are talking about here. What we’re talking about in LIFO are terminations and layoffs that have to happen only during budget cuts. So that process, that dismissal process, isn’t really at play. We’re talking about a situation when the district is in such dire financial constraints and is having such a problem figuring out its budget that they have to go to teachers, they have to go to laying them off and they have to make that decision, according to the law, by the level of seniority instead of thinking about the great teachers that are in the classroom and that should stay there.
Think about what Polechronis is assuming: that a district like Newark has the ability to differentiate at a very fine level the effectiveness of individual teachers, and then act accordingly in high-stakes decisions.

Let's be very clear: There is no evidence -- none -- that teacher effectiveness can be measured reliably and validly at a level that allows for high-stakes decisions to be made regarding teachers who have already been found to meet a minimal level of effectiveness.

What PEJ argues implicitly is that the Newark Public Schools can simply use its observation rubrics and Student Growth Percentiles and Student Growth Objectives to calculate an overall measure of teacher effectiveness, and then apply that measure to fairly determine who gets the boot when budgets cuts "must" be made. But this contradicts everything we know about measuring teacher effectiveness.

Yes, principals can identify their very worst teachers; they are incapable, however, of differentiating the effectiveness of the vast bulk of teachers in the middle. The phony precision of observation protocols like the Danielson Model have led some to think we can validly use the resulting scores to accurately rank and order teachers; that is a mistaken belief grounded in innumeracy. In the same way, the error that is an inherent part of standardized tests makes the use of SGPs in decisions like this invalid (among many other reasons). And SGOs are, to be blunt, a joke.

The plain truth is that even if PEJ got its way and teachers could be dismissed without regard to seniority, there is no reliable and valid way to evaluate the majority of teachers who are dismissed in reductions-in-force. Yes, we can identify the worst performers; we can and should either get them remediation or remove them from their classrooms. But there's simply to reason to believe Newark, or any district, can accurately rank all teachers by their effectiveness.

But that's not the only failure of logic in PEJ's case. Because even if districts could make accurate decisions based on effectiveness -- again, they can't, but play along -- they would still have to show that districts like Newark were disproportionately affected by LIFO laws.

Unlike school funding -- which, despite all of the lawsuits, is still inequitably distributed across the state -- tenure and LIFO laws apply to every district equally. Newark and more affluent Millburn both have to operate under tenure laws; Camden and more affluent Haddonfield both have LIFO. Yes, the cities have had to make cuts in staff, in large part because charter schools, imposed by the state, have gobbled up more students and more resources. But that's not a function of tenure or LIFO laws; how could it be?

Reading through the PEJ's filings, it's clear they are unable to make a case that urban students have suffered disproportionately by tenure; in fact, as NJEA points out in one of its briefs, there isn't even evidence that any of the plaintiffs' children suffered from having a bad teacher who was spared dismissal by the LIFO laws, calling into question the plaintiffs' standing.

What is clear is that Newark's schools have suffered from inadequate and inequitable funding; even the plaintiffs acknowledge students have suffered from losses of staff like librarians and guidance counselors (p.9-10). But they put forward no argument that removing LIFO laws would have saved those jobs; again, how could they?

Some have argued that dismissing senior, higher-paid employees frees up more funds for lower-paid, less senior staff, thus leading to fewer reductions. This assumes that teacher effectiveness is evenly distributed across experience, which we know is not true -- when you cut experienced teachers, you're more likely to cut effective teachers (and again: we're setting aside the problem that you can't rank and order the vast majority of teachers by effectiveness anyway).

It also assumes that there is so much inefficiency within urban schools that they can cut staff and retain programming and class size. Empirically, however, we know that NJ's urban schools are not systemically its most inefficient ones. We also know that funding adequacy correlates with staff per student in various educational programs, which means the problems of cutting staff and programming have much more to do with inadequate funding than they do with tenure and LIFO -- policies, again, which are enforced in all districts.

Finally, it's important to remember that teachers value tenure and LIFO. If the state gets rid of it, that decreases the overall compensation, momentary and otherwise, of teachers. Are the taxpayers of New Jersey willing to fork over more money to make up for this loss in incentives? Or do they want to see a less qualified pool of prospective teachers enter the profession?

The backers of these lawsuits will make occasional concessions to the idea that schools need adequate and equitable funding to attract qualified people into teaching. But they never seem to be interested in underwriting lawsuits that would get districts like Newark the funds they need to improve both the compensation and the working conditions of teachers.

Instead, they waste their time with lawsuits like this -- suits that fail on legal, empirical, and logical grounds. Suits that do nothing to help deliver the resources all students need to equalize educational opportunities. Suits that do nothing to improve the effectiveness of New Jersey's teaching corps, or the efficiency of its school system. Suits that only serve to further dishearten the people who go to work in public schools every day on behalf of the taxpayers and students of this state.

Maybe one day Campbell Brown and the PEJ will stop trying to take away the hard-fought rights of teachers, and take up the real fight for our state's deserving children.



ADDING: As if on cue:
ATLANTIC CITY — The school district advertised three times for a certified chemistry teacher last summer and fall, and three times they failed to get a candidate to accept the job.
So they turned to Edmentum, a provider of online courses, to fill the gap. This year, four classes at the high school are being taught via the online course, with backup support from a teacher.
[...] 
The statewide shortage makes the position competitive. At least three area school districts are looking for chemistry teachers next year.  
Ralph Aiello, principal at Cumberland Regional High School, said he’s looking for a combined chemistry/physics teacher for next year. So far, he has had just two applications. 
Linda Smith, president of the New Jersey Science Teachers Association, said she is working with colleges to develop programs that recruit former or retired scientists into teaching as a second career.
“People can just make more money as scientists than they can as science teachers,” she said. “Some do want to teach. But they need training and mentoring. People who are good at science are not always good at explaining it.” [emphasis mine]
Terry Moe, hardly a friend of teachers unions, states: "...most teachers see the security of tenure as being worth tens of thousands of dollars a year.” So please, PEJ: Explain to us how eliminating tenure and LIFO will help recruit better candidates into a profession that is already suffering from serious shortages.

(This should be good...)