When companies that sell instructional software used to come calling on Reid Lyon, expert on reading instruction and former advisor to President Bush, he played a little game. First, he listened politely to the sales reps’ enthusiastic pitches and colorful demonstrations of how computer software can build reading skills in new ways. Then he asked to see their technical manuals.
“I always found nothing in there that would help the consumer determine if this stuff really works,” said Lyon, who last year ended a 10-year stint as chief of the Child Development and Behavior Branch at the National Institute of Child Health and Human Development, which sponsors studies on reading. Regarding software, Lyon said, he would “rarely see any data that I would consider credible.” These encounters with software makers happened so often—three to four times a month in Lyon’s experience—that he observed a pattern. “They always came in excited, and left very depressed.”
Educational software makers may get rebuffed by authorities like Lyon, whose endorsements, companies believed, could lead to governmental stamps of approval, and thus explosive sales. But they usually get warmer receptions in the offices of the nation’s school superintendents, who are, after all, their primary customers. The system was not supposed to work this way. President Bush’s No Child Left Behind Act (NCLB) famously requires that any instructional materials supported by federal aid be proven to work through “scientifically based research.” Unfortunately, scientific proof is defined in many ways. In the world of education software, definitions abound.
Wrestling with Slippery Evidence
According to the Institute of Education Sciences (IES), the primary overseer of research within the Department of Education, “scientifically based research” fits the following criteria: It randomly assigns its test subjects to comparable groups; it yields reliable, measurable data; if the study makes any claims about what causes its effects, it “substantially eliminates plausible competing explanations”; its methods are clear enough that other researchers can repeat or extend them; and, finally, the study has been accepted by a peer-reviewed journal or equivalent panel of “independent experts.”
But not everyone reads bulletins from the Department of Education, or interprets them the same way. Purveyors of education products of all kinds make claims that they’re based on scientific proof, and thus “aligned” to federal requirements. A good many trot out studies, sometimes great numbers of them, that appear to have followed one or more steps that are hallmarks of gold standard scientific research. A select few software programs (such as Cognitive Tutor, an unusually sophisticated math program, and Fast ForWord, a language program) have gone through at least some careful vetting. But the vast majority fall somewhere along the other side of the scale. As Lyon puts it, claims “are based on any kind of document—whether it’s an unpublished technical manual, an opinion piece, or an editorial. These people either don’t understand the law’s requirements, or they’re trying to game the system.”
Federal authorities were supposed to lend districts a helping evaluative hand. This was to be done primarily through a project funded by the IES called the What Works Clearinghouse (WWC). Created in 2002, the WWC gathered a team of top-flight research experts whose mission was to review studies done on a range of instructional packages—both traditional and electronic—and rate the quality of their achievement data (see Figure 1). After four years and an expenditure of $23 million, the WWC had evaluated studies on only 32 products. While the WWC itself earned good ratings for the rigor of its work, plenty of people were frustrated with how little was getting done—and how few studies met the agency’s standards. After complaints mounted, the WWC sped up its work. By December 2006, it had reviews out on 51 products. To accomplish this, the WWC went through 255 studies. Still, the vast majority of studies (75 percent) did not meet the agency’s scientific standards, even with some “reservations.”
By this time, critics were calling the WWC “the Nothing Works Clearinghouse.” The nickname carries an important double meaning. To some, it’s another example of governmental blockheadedness—specifically, that understanding how teaching and learning work in the real world is beyond the skill of a federal agency. To others, including many leaders in the research community, the message is actually more harsh. It is that most new classroom gimmicks don’t add much of value, and studies packaged to suggest otherwise are to be treated with great suspicion. In fairness, suspicious research sometimes contains perfectly innocent flaws. That’s because truly scientific research is extremely difficult, time-consuming, and costly—and thus very rare—which is precisely why the WWC has found so few studies to be satisfactory.
To compensate for the WWC’s academic outlook and pace, many other organizations, both private and governmental, have developed their own evaluation systems to help schools navigate the dizzying array of curricular products on the market today. (These include a RAND Corporation resource, called the Promising Practices Network; the Comprehensive School Reform Quality Center, from American Institutes for Research; the Best Evidence Encyclopedia, out of Johns Hopkins University; and even a global survey called the International Campbell Collaboration.) While some offer useful information, their criteria and standards vary widely. And this may further confuse, or mislead, school purchasing agents.
The IES has since 2003 been working on its own evaluation of educational software, through “gold standard” methods of scientific research. The ambitious $15 million study, due sometime early in 2007, has some peculiar characteristics. IES did not begin by selecting the most popular products, but instead asked software producers to volunteer; it then chose 15 products from among those that did. And while the evaluation methods of IES appear to have been exacting, the study will answer nothing more than the most general question: Does educational software, as a class, tend to work? The individual evaluations of the 15 packages will not be released.
This is odd for two reasons. First, the basic question about software’s general effectiveness has long been answered: as a whole, it works no better than cheaper traditional materials. (Former North Carolina State professor Thomas Russell watched so many studies come to this conclusion over the years that he eventually compiled a book on the subject. Covering 355 different studies done since the early 1900s, the book was titled, aptly, The No Significant Difference Phenomenon.) The more precise answer is, it depends on which software you’re using, with what ages, and in what circumstances. But if you are a teacher or administrator trying to make a shopping decision, “this [study] isn’t going to help you,” admits the IES study’s lead researcher, Mark Dynarski of Mathematica Policy Research, Inc., in Princeton, New Jersey. “We’re trying to help Congress, which is spending more than $700 million a year to support technology.” As basic as the general answer may be to education insiders, Dynarski believes it will be news to policymakers. “If it is this difficult to know whether this stuff helps, why is everyone so anxious to know whether to purchase it?”
The second oddity to the IES arrangement is the bargain it struck. To be included in the study, companies had to donate software for 132 schools and teacher training. In return, they get two important gifts: a free study, complete with a federal stamp of approval, and the study’s individual evaluations. Companies can package and spin those evaluations however they like, since no one besides IES will have those details. “I don’t know what they will do with the data,” Dynarski said.
All of which raises a very large and thorny question: What really does happen, on the ground, inside the schools, when research spin and marketing hype collide with desperate classrooms?
Money, Money Everywhere
Software sales to schools have certainly been robust. According to Simba Information, a media analysis firm based in Stamford, Connecticut, the nation’s K–12 schools bought $1.9 billion of electronic curricular products in 2006. While that is less than a fourth of the instructional materials market as a whole, the electronic sector’s growth has been vigorous—up 4.4 percent from 2005 to 2006, as compared with the 2.6 percent growth rate of the overall instructional products market.
Among the many electronic products that schools buy, the most visible have been those geared toward reading. Building up the nation’s reading skills was of course the main impetus behind No Child Left Behind. Today, schools can pick from a cornucopia of federally funded initiatives in this domain. There is Reading First, which aims $1 billion a year at grades K–3; various funds for “supplemental” products; special programs to promote educational technology or “comprehensive school reform”; and numerous initiatives under Title I, the overarching federal fund for poor students.
This plethora of options—and money—has produced an abundance of new marketing opportunities for software companies. Strangely, one of the richest of those is NCLB’s scientifically based research requirement. Conceived as a strict rule with clear methodological standards, it has instead become a versatile tool—a Cuisinart for raw statistics, ideal for marketing hustle and deception. It has not helped matters that, as with many laws, it is beyond the ability of those it affects most to understand its crucial details; nor did the law’s creators endow it with any sort of enforcement system.
L.A.’s $50 Million Gamble
Consider the story of Waterford Early Reading, distributed by Pearson Digital Learning. Pearson is the nation’s leading seller of educational software, and its Waterford program is used in 13,000 classrooms in all 50 states. The WWC has not yet evaluated Waterford, but it is one of the 15 products that IES has elected to study.
So what could be known about products like Waterford if government evaluators chose to look into and report on the daily experiences of teachers and students? One particular school district, Los Angeles Unified (LAUSD), has a long and remarkably troubled history with this product. In July of 2001, LAUSD decided to spend nearly $50 million on Waterford, instantly making itself the company’s largest customer.
Waterford is designed for the earliest readers, students in grades K–2. It requires students to spend 15 to 30 minutes a day with various multimedia exercises, and costs $200 to $500 per student. In launching the program, Roy Romer, then LAUSD superintendent, said this “is like putting a turbocharger in a car engine. We are going to accelerate reading performance in Kindergarten and first grades.”
Several years later, the district’s own evaluation unit pronounced the program a failure. In a 2004 report, its second with negative findings, the evaluators said, “There were no statistically significant differences on reading assessments between students who were exposed to the courseware and comparable students who were not exposed to the courseware.”
Some teachers found the program helpful, but many did not. Pearson and other supporters of the program argued that Waterford’s effectiveness was compromised by the fact that teachers didn’t fully use the product—and when they did, they often used it incorrectly. But L.A.’s last evaluation found that “neither the amount of usage nor the level of engagement had an impact on achievement.”
The report, which did not make the news until early 2005, stunned L.A. school officials. Even Romer backtracked. “As I looked at this, it didn’t provide as much bang for the buck as I would have liked,” he told the Los Angeles Times. The district has since scaled back the Waterford program, using it as more of a sideline specifically for students with learning difficulties. (One problem was that Waterford was taking time away from students’ primary literacy lessons, thereby causing actual declines in achievement.) Teachers and administrators both say sidelining Waterford has helped. But it also means that the district is getting a lot less for its $50 million than it planned on. School board members soon questioned the wisdom of the whole venture.
What happened here? And what lessons do Waterford’s rise and partial fall in Los Angeles offer education policymakers, not only in other states but also in Washington?
The first lesson is to beware of seemingly persuasive numbers. Many curriculum producers started promoting their “scientific” research very soon after NCLB required it—an impossible feat, considering the many years it takes to conduct solid scientific studies.
How does questionable research get produced? In Waterford’s case in Los Angeles, Julie Slayton, an analyst in LAUSD’s Program Evaluation and Research Branch and one of the authors of the Waterford evaluations, says Pearson did not try to tilt the evaluators’ basic data, “but they did their best to make the report come out favorably. They tried to make it focus on implementation instead of effectiveness.” In other words, Pearson wanted the question to be about the district’s wobbly use of Waterford—not whether the program itself inherently worked.
After failing to change the district’s opinion, Pearson prepared a preliminary evaluation of its own—a “briefing packet” for Superintendent Romer, full of numbers indicating that Waterford was producing dramatic achievement gains. Ted Bartell, director of research for LAUSD, was not pleased. In a memorandum dated May 8, 2002, he urged Pearson to spell out “the methodological limitations” of its studies. Bartell argued that Pearson’s sample sizes (60 students) were too small for solid conclusions to be drawn from them; that there was no evidence that Waterford and not other factors caused the gains; and that the gains were too small to be meaningful in any case, or to be representative of the district as a whole.
In the following months, Pearson continued to generate numbers indicating success, but the data drew from problematic sources. Some used the Academic Performance Index (API), which covers all grades in a school; this confounded the picture of K–2, where Waterford was used. Some involved the California English Language Development Test (CELDT), which is meant only for non-English speakers and tests advanced literacy skills that Waterford doesn’t directly teach. Others drew from reading inventories created by Pearson itself. “They kept using gains and measures that are not relevant to their program,” says Lorena Llosa, a former LAUSD research analyst who was completing a doctorate in applied linguistics that employs CELDT data.
In retrospect, Slayton says, “Pearson does an enormously aggressive job. They pressured us. They are in your face. They are obnoxious, and they don’t go away. They are very, very good at giving people a show.”
Before Los Angeles invested in Waterford, there were plenty of signs that the program might not work quite as the company promised. For years, Pearson, like many companies, had been gathering studies that offered evidence of Waterford’s effectiveness. But various independent evaluators had pointed out that the bulk of these studies have methodological problems (lack of control groups, small sample sizes, missing information, numbers based on subjective survey data, and so forth). “In light of these limitations,” the LAUSD evaluators said in their first report, in 2002, “much of the information from these evaluations should be interpreted with caution.”
Why didn’t Los Angeles officials do that? One reason is that researchers who evaluate classroom exercises and educators who work inside those classrooms represent two often conflicting cultures—this is lesson Number Two. As an illustration, when Ronni Ephraim, the district’s chief instruction officer, was asked if she had looked at the research on Waterford before supporting its use in LAUSD, she said she had not, largely because it did not seem relevant. “Every classroom situation is different,” she says. “And nothing compares to L.A. I’d rather listen to my own teachers.” Ephraim’s worldview is broadly shared. To the average administrator, the sensations of success or failure inside your own classrooms are going to feel a lot more relevant than abstract statistics drawn from schools on the other side of the country. Is it any wonder, then, that NCLB’s scientific research requirements have been so widely ignored?
To researchers, however, Ephraim’s way of thinking can make an instruction method look like it’s working when it’s not. All too often, some other environmental factor is driving the improvement; sometimes, in fact, the gains are just normal growth associated with getting older. For reassurance that none of this is the case, researchers commonly begin by looking for two facts in particular: first, studies of the program published in independent, refereed journals—ideally those of high repute; and second, the researchers’ use of truly comparable groups.
So far, very few commercial programs meet these standards—although many claim they do. One such example is Renaissance Learning, Inc., whose lead product, Accelerated Reader (AR), is used in more than half the nation’s public schools. Renaissance has built such a following behind AR that it regularly holds massive annual conferences that feel like religious revival meetings. Testimonials at these conferences are typically adorned with lengthy, seemingly solid studies proving AR’s power. Yet none of these studies have held up to serious scrutiny. “These studies all suffer from serious confounds or design problems that make it impossible to show that AR improves reading,” says Tim Shanahan, professor of urban education at the University of Illinois, Chicago, and director of its Center for Literacy. Shanahan also was a member of the National Reading Panel, the group of reading experts chosen by congressional mandate that published an exacting report in 2000 evaluating decades of research on reading instruction.
Other commercial packages, both computerized and traditional, have not fared much better. In the What Works Clearinghouse ratings, not a single product has more than one study fully meeting WWC research standards. This includes two well-respected software packages: I CAN Learn and Cognitive Tutor. It should be noted, however, that good research is not a quick sign of product effectiveness. Fast ForWord, for example, had one study that cleared the WWC’s top bar. That study found that while Fast ForWord was effective with language development, it was ineffective with reading achievement.
Although Waterford has yet to be reviewed by the WWC, it has been treated to four evaluations published in peer-reviewed journals. The first was a 2002 study in the Journal of Experimental Child Psychology, which found Waterford to be helpful in only one of nine areas the researchers tested. The second, in 2003, was a mixed evaluation in Reading Research Quarterly, the journal of the International Reading Association. Why the ambivalence? “The more careful the study, the more mixed the results. That’s the punch line,” explains Tracy Gray, an expert on educational technology with American Institutes for Research.
The next two studies—a 2004 evaluation in the Journal of Literacy Research (JLR), the publication of the National Reading Conference, and a 2005 article in Reading & Writing Quarterly—were both positive. But it’s not clear these were full or fair contests. Neither study says much about what the non-Waterford students did while their peers played with the new computers; typically, the “control group” gets nothing of comparable novelty or potential power.. The 2004 study, Lyon says, suffers from “fatal flaws.” And the 2005 study used unusually small test pools: 46 students, who were divided into four even smaller sub-groups. Three of these four, including a non-Waterford group, all made gains. So the study’s praise for Waterford rests on one small group: 12 low-performing 1st graders who did not use Waterford. “That’s promising, but it’s not enough evidence to justify spending a lot of money,” says Shanahan.
Pinning Down the Truth
These varied interpretations illustrate yet another lesson: Experts don’t all agree on what constitutes good research. As an example, Wayne Linek, professor of education at Texas A&M University, in Commerce, Texas, and co-editor of the JLR, finds Lyon’s view too narrow. True experimental designs that compare treatment groups to untreated “control” groups can turn students into “guinea pigs,” which Linek considers unethical. Furthermore, Linek says, these kinds of studies can lean on quantitative measures—frequency of eye movements, for example, or recognition of certain letters—that are suspect and thus “bad predictors.”
Linek is partly right to complain about the current obsession with measurable data, which is a double-edged sword. On the one hand, the data are critical—they allow other researchers to replicate the original study’s findings, or take them further. On the other hand, any study that looks for data in obscure factors like eye movements can be justly criticized for missing the main event, despite the fact that it qualifies for publication in any number of relatively credible journals.
Linek’s first point raises an even more important question: What’s wrong with turning students into guinea pigs, anyway? Medical researchers do this with people all the time. If mistakes are going to be made, isn’t it more humane to make them with a small number of test subjects than with the general population?
While the research community debates such questions, the commercial sector has felt free to devise its own interpretations. Andy Myers, chief operations officer for Pearson Digital Learning, takes a sunny view of Waterford’s scientific grounding. “In our experience,” he says, “the more in-depth the evaluation process is, the more the Waterford program shines.” When asked about the weaknesses that Lyon and others see in the Waterford studies, Myers acknowledged the studies’ limitations. Then, echoing Ephraim, L.A.’s chief instruction officer, Myers questioned the relevance of the research itself. “The studies may or may not meet the rigorous standards of the What Works Clearinghouse,” he said. “But what’s more important to a district is, ‘Does it work with our students?’ Then they’re going to want to expand it.” When a prominent company like Pearson dismisses the quantitative research, it exploits the fact that teachers and their administrators don’t understand it, and excuses them for disregarding it.
If the education world has been living in a kind of scientific denial, that era may be fast drawing to a close. At this point, the advocates of classic, quantitative science clearly have the ear of the Bush administration. And they are setting today’s education standards. With its next round of software research, IES will release individual evaluations. It is also expanding its evaluations to include textbooks. Eventually, a complete list of governmentally approved product ratings will be just a mouse click away. That means that someday, it will be easy for the marketplace—namely, district superintendents and their purchasing agents—to embrace high scientific standards. When that time comes, the spoils won’t go to companies that have busied themselves ginning up studies full of tilted numbers. The victors will instead be those who practice true R&D—that is, companies that dare to use these intervening years of confusion to subject their materials to research that is both rigorous and independent.
Todd Oppenheimer is the author of The Flickering Mind: Saving Education from the False Promise of Technology, which was a finalist for the 2003 book award from Investigative Reporters & Editors. He can be reached at www.flickeringmind.net.