
Researchers have finally settled the debate about whether social-emotional learning, or SEL, boosts test scores. At least, that’s what recent news reports would have you believe. Outlets like EdWeek and NPR report that SEL can raise achievement by 4 to 8 percentile points, citing new data as “clear evidence” that SEL programs lead to better grades and test performance. These claims all trace back to a new meta-analysis from USC and Yale that has already shaped national coverage. But before policymakers take these estimates at face value, it’s worth looking closely at how the study was built.
The press release promoting the study asserts that “universal SEL programs are a sound investment in education systems worldwide,” that we now have “rigorous scientific evidence” that SEL improves both student well-being and academic achievement, and that SEL programs “should not be viewed as add-ons” but as essential components of schooling. Those are substantial claims. Does the underlying evidence supports them?
The authors set out to determine whether SEL programs improve academic achievement among K–12 students and whether effects differ by grade span, subject area, type of outcome measure, or program duration. To do this, they searched for studies of universal SEL interventions conducted anywhere in the world and published between 2008–2020. The operational definition they rely upon is this: “Universal school-based social and emotional learning interventions support the development of intra- and interpersonal skills to promote physical and psychological health for all students in a given school or grade, including fostering the development of emotional intelligence, healthy behavior regulation, identity formation, and the skills necessary for establishing and maintaining supportive relationships and making empathic and equitable decisions in the best interest of the school community.”
Their final sample included 40 studies of 30 different programs—ranging from mainstream SEL curricula developed and implemented in the U.S. to interventions as varied as Tai Chi, yoga, and “The Little Prince Is Depressed” (a program developed in Hong Kong to prevent depression among Chinese adolescents). Seven of the 40 studies were unpublished dissertations. Of the full set, 29 studies used randomized controlled trials, and about one-quarter of the evidence base was unpublished.
The authors say their review offers important guidance for SEL decision-making. A closer look reveals reasons to be cautious about the very confident conclusions now circulating in the media.
Meta-analysis is powerful when the studies being synthesized are truly comparable, but that assumption is strained here. The included SEL programs differ in intensity, duration, purpose, and instructional design; the achievement measures differ in how they’re scored and how much confidence we can place in those scores; and teacher training ranges from minutes to multi-day workshops with ongoing coaching. Some studies rely on random assignment, while others do not. And in one case, the review treats different analyses of the same data as if they were separate findings. These limitations make the pooled effect size the authors report hard to interpret and suggest that claims about test-score gains may be premature.
How Comparable Are the Outcome Measures?
These 40 studies also differ substantially in how they measure academic achievement. Many do not use standardized tests or course grades but rely on coarse rating scales. A five-point scale, such as the one relied upon in the Dowling et al. (2019) study, provides only a rudimentary signal of differences between students. Small improvements are hard to ascertain because there are so few score options and students “top out” quickly—once they reach the highest score, you cannot detect further growth. Such a scale won’t be able to detect subtle differences between students and relies on teachers to rate students consistently.
Similarly, Fishbein et al. (2016) rely on a four-point teacher rating of students’ general academic standing. Muratori et al. (2016) use a ten-point scale derived from classroom assignments. And Tak et al. (2014) rely on student self-reports of their most recent grades—a method the authors note may be subject to recall bias.
Other studies use standardized assessments. Hanson et al. (2012) use state tests; Stone (2009) uses the MAP test (a widely used interim assessment system). But treating teacher impressions, self-reported grades, and state assessments as equivalent indicators of achievement is cause for concern. Differences in scale, reliability, and meaning make it difficult to interpret the pooled estimate as capturing a single construct.
Differences in Teacher Training for Optimal Implementation
A third source of variation concerns the training teachers received before implementing a given curriculum. Some studies appear closer to efficacy trials. Jones et al. (2010), for example, describe 25 hours of teacher training and ongoing classroom coaching. Berger et al. (2018) report a four-day workshop (led by the study’s first author) and continuing supervision. Others describe much lighter preparation. Bakosh et al. (2016) provided 30 minutes of training on “program content, structure, and classroom tools, as well as related research on mindful awareness, cognition, and social emotional learning.” Hanson et al. (2012) offered one day of training and two hours of coaching.
Discussing the findings of their 2015 study, Robert Weiss and colleagues note that programs implemented under real-world conditions often receive far less support than those studied under ideal circumstances. Combining these two categories without distinction can lead to misleading conclusions about likely effects in typical classrooms.
Geographic Variation
Finally, the interventions were conducted in educational systems around the world. Twenty-five studies are from the United States, three are from Spain, two are from England, and there’s one each from Australia, Canada, China, Finland, Ireland, Israel, Italy, the Netherlands Norway, and Tanzania. In the introduction to each of these papers, the authors patiently explain why the features of that context make it necessary to develop or adapt existing interventions for the unique circumstances of the country under study. So, it feels odd to flatten those differences by compiling distinct effect sizes in a meta-analysis.
As Berry et al. (2016) note in reporting null findings from a randomized controlled trial in England, “[I]t cannot be assumed that an evidence-based programme will work in all contexts.”
Research Design
Looking beyond comparability, there are also questions about research design. The meta-analysis reports an average sample size of 843 students per study, but this figure is skewed by a single study with 5,791 observations. I reviewed the 40 studies to calculate the median instead, which is 337 students—a more informative statistic given the wide variation in sample sizes.
Some studies involve very small samples. Felver et al. (2019) include fewer students than a typical elementary school classroom. The largest study, with its 5,791 observations, is a dissertation. Because the meta-analysis weights studies by sample size, this large, unpublished dissertation wields disproportionate influence on the pooled effect, despite relying on a basic design comparing two treatment and two comparison schools and adjusting only for pre-existing differences in pre-test scores.
EdNext in your inbox
Sign up for the EdNext Weekly newsletter, and stay up to date with the Daily Digest, delivered straight to your inbox.
Weiss et al. (2015), reflecting on their own quasi-experimental study, put the challenge succinctly: “We cannot infer a causal relationship. . . . highly motivated teachers may be more likely to volunteer. . . . the lack of random assignment cannot rule out the possibility that maturation might account for the academic gains seen.” Several studies included in this review rely on quasi-experimental designs with similar limitations.
Violation of a Core Assumption of Meta-Analysis
The treatment of the INSIGHTS intervention illustrates a more fundamental issue. Table 2 of the meta-analysis lists three separate studies by a research team led by McCormick. All three papers describe the same intervention and rely on student scores on the Woodcock-Johnson III Tests of Achievement. But the meta-analysis treats these three sets of results from the same INSIGHTS study as if they provide three independent estimates of the program’s impact. To be clear, all three papers rely on the same test score data collected from the same group of children, but each paper looks at the program from a different angle. One paper examines the process inside classrooms to understand why the program might improve achievement. Another paper looks at differences across schools to see where the program seems to work best. A third paper focuses on how much families participated in the program. These are interesting questions, but they are not three independent estimates of whether the program works. Meta-analysis doesn’t work like this. By combining “mediation,” “moderation,” and “dosage” findings as if they were three independent measures of the program’s impact, the meta-analysis is effectively giving one study three times the weight of others, amplifying the influence of a single intervention in the pooled estimate. The resulting summary estimate is misleading.

A Path Forward for Researchers
None of these concerns imply that SEL lacks value, but they do suggest that the field would benefit from clearer standards:
- Prioritize randomized experiments whenever feasible. Otherwise, acknowledge when designs do not support causal claims.
- Use outcome measures capable of detecting meaningful differences, such as standardized tests, rather than coarse and subjective rating scales.
- Separate the roles of program developer and program evaluator and avoid situations where evaluators train teachers or supervise implementation. For example, the evaluation by Wright et al. (2010) has four authors, half of whom were involved in program delivery. Similarly, Berger et al. (2018) describe how the first author of the evaluation trained the teachers in a four-day workshop before the intervention began.
- In the same vein, separate program advocacy from evaluation. Evaluators whose professional role is to promote SEL may face conflicts of interest. One of the studies used in this meta-analysis (Fishbein et al., 2016) was co-authored by an employee of the Collaborative for Academic, Social and Emotional Learning (CASEL), which is a nonprofit organization built around the mission “to make SEL part of all students’ education.”
Strengthening research design will help ensure that claims about SEL’s impacts rest on the strongest possible evidence.
Conclusion
The idea that “cognition and emotion are inextricably tied together” (as NPR quoted one of the authors as saying) is neither new nor controversial. Few educators would disagree with that conclusion. But demonstrating a broad relationship between emotions and learning is not the same as providing rigorous evidence that any given SEL program will raise students’ test scores. This meta-analysis is an ambitious attempt to summarize a heterogeneous body of studies, but given the diversity of programs, measures, and contexts included—and the methodological issues noted above—the resulting estimates should not be treated as definitive. Continued investment in well-designed studies will help the field move toward clearer answers. In the meantime, policymakers should view broad claims about test-score gains from SEL programs with healthy skepticism.
Notes and Corrections
Note 1: The meta-analysis states that it includes 40 studies but lists only 39 in the references and in Table 2. The authors clarified in correspondence that one study—Tak et al. (2014)—was inadvertently omitted during proofing but was included in the analyses.
Tak, Y. R., Kleinjan, M., Lichtwarck-Aschoff, A., & Engels, R. C. M. E. (2014). Secondary outcomes of a school-based universal resiliency training for adolescents: A cluster randomized controlled trial. BMC Public Health, 14, 1171. https://doi.org/10.1186/1471-2458-14-1171
Note 2: The press release says the meta-analysis incorporates studies conducted across 12 countries, and the article states that the sample covers 15 countries. By my count, there are 13 countries represented in the studies they analyze: Australia, Canada, China, England, Finland, Ireland, Israel, Italy, Netherlands, Norway, Spain, Tanzania, and the United States.
Note 3: The Donato dissertation appears to have been completed in 2010, not 2009.

