Education Next, Winter 2014
Because the randomized controlled trial approach has the important feature of generating comparable treatment and control groups, we can use a straightforward set of analytic techniques, designed for use in social experiments, to estimate the impact of a school tour to an art museum on student outcomes. In its most simple form, this technique could estimate simple mean differences using the following equation for outcome Y of student i in matched pair m:
(1) Yim = α + β1Treati + β2Matchim + εim
The binary variable Treati is equal to 1 if the student is in the treatment group that was randomly assigned to visit the museum for a school tour and is equal to 0 otherwise. Because the groups were created using a stratified randomization procedure within matched applicant group pairs, Matchim is also included in the model as a vector of dummy variables that have the statistical effect of estimating within, as opposed to across, matched pairs. Finally, εim is a stochastic error term clustered at the applicant group level to take into account the spatial correlation from students nested within applicant groups.
Proper randomization generates experimental groups that are comparable but not necessarily identical. The basic regression model can, therefore, be improved by adding controls for observable characteristics to increase the reliability of the estimated impact by accounting for minor differences and improving the precision of the overall statistical model. This yields the following equation to be estimated:
(2) Yim = α + β1Treati + β2Matchim + β3Genderi + β4Gradei + + εim
where Genderi is a dummy variable equal to 1 if the student is a female and 0 otherwise, and Gradei is a vector of dummy variables indicating the grade level of student i. In this model, β1 is the parameter of interest and represents the effect of a school tour for students in the treatment group. Equation (2) is our preferred model for estimating overall impacts.
In addition, we are interested in the possibility of heterogeneous effects on particular subgroups of students. Subgroup effects are estimated by augmenting the basic analytic equation with indicator variables and an interaction term where Si indicates that a student is a member of a particular subgroup:
(3) Yim = α + β1Treati + β2Matchim + β3Genderi + β4Gradei + β5Si + β6Si*Treati + εim
These models are used to estimate impacts on the separate components of the subgroups (e.g., impacts on minority and non-minority students separately) and test for the difference in impacts between the two groups. In our analyses, we examine the subgroup effects for students in schools that have higher (> 50%) or lower (< 50%) proportions of students who are FRL-eligible; students attending schools located in smaller towns (< 10,000 population) and larger towns (> 10,000 population); white and non-white students; and students making their first visits to the museum. When examining the impact of a first visit, we restrict our dataset to students in the treatment group who had only visited the museum once (i.e., on the school visit) and students in the control group who had never visited the museum. This excludes students who had been to Crystal Bridges outside of the school visit program prior to being surveyed.
Comparability of Treatment and Control Groups
Even within randomized controlled trials treatment and control groups may differ significantly from each other by chance. To explore whether that occurred in our experiment, we compare the observed characteristics of treatment and control group students. We find no significant differences on observed characteristics.
Different outcomes in our study are based on different samples. The tolerance and historical empathy outcomes are based on items included in the survey administered to students during the spring of 2012. The critical thinking measure is based on an exercise given to students during the fall of 2012. And the interest in art museums measure is based on items in surveys given to students during both semesters. The demographic characteristics for all three samples (Spring 2012, Fall 2012, and combined) are presented below in appendix tables 1, 2, and 3. None of the 27 differences between the observed characteristics of treatment and control group students presented in those tables is statistically significant at the conventional p <.05 level. The town population in the Spring 2012 sample differed at the p<.10 level, but with 27 comparisons finding one such difference could occur by chance. We conducted joint F-tests for all three samples and found that, taken as a whole, the characteristics of our treatment and control groups did not differ significantly from each other.
We also administered a different survey to students in grades Kindergarten through 2nd grade. We collected fewer descriptive characteristics about the K-2nd grade sample, but as shown in appendix table 4, we find no significant differences between the younger treatment and control group students either.
Appendix Table 1: Treatment/Control Balance of the Spring 2012 Sample, Grades 3-12
(n = 1,899)
(n = 2,106)
|School % FRL||50.44||52.73||-2.29|
|Miles from museum||35.23||37.10||-1.86|
* p < .10, two-tailed.
Appendix Table 2: Treatment/Control Balance of the Fall 2012 Sample, Grades 3-12
(n = 1,860)
(n = 2,431)
|School % FRL||58.10||58.56||-0.46|
|Miles from museum||34.90||43.64||-8.73|
Appendix Table 3: Treatment/Control Balance of the Combined Sample, Grades 3-12
(n = 3,759)
(n = 4,537)
|School % FRL||54.20||55.86||-1.66|
|Miles from museum||35.07||40.60||-5.53|
Appendix Table 4: Treatment/Control Balance of the Combined Sample, Grades K-2
(n = 1,445)
(n = 1,189)
|School % FRL||41.66||55.56||-13.90|
|Miles from museum||14.95||22.43||-7.48|
Critical Thinking Skills Inter-Coder Reliability
Our measure of critical thinking skills was developed and validated by Adams, Foutz, Luke, and Stein (2007) in their study of the School Partnership Program at the Isabella Stewart Gardner Museum in Boston. Students in 3rd through 12th grade during the Fall of 2012 were shown a copy of Bo Bartlett’s painting, The Box. Students were asked to write a short essay in response to the questions: “What do you think is going on in this painting?” and “What do you see that makes you think that?” Their answers were scored blindly by one of two researchers with the two researchers overlapping in their coding of 750 of the responses.
The critical thinking measure is based on the number of instances that students engaged in the following in their essays: observing, interpreting, evaluating, associating, problem finding, comparing, and flexible thinking. Our measure of critical thinking is the sum of the counts of these seven items.
Based on the sample of 750 essays scored by two researchers, we are able to calculate inter-coder reliability. Our researchers were highly consistent in their scoring of the combined critical thinking score as well as on almost all seven components of that score. As can be seen in appendix table 5, the Cronbach’s Alpha for the composite critical thinking score is .94. For the components, the Cronbach’s Alpha was between .73 and .85 for five of the seven items. Inter-coder reliability was weaker when scoring problem finding and comparisons, but those components were only displayed rarely and make little difference to the composite score.
Appendix Table 5: Inter-Coder Reliability for Critical Thinking Items
|Item||Average (Std. Dev.)||Cronbach’s Alpha|
|Composite (Sum of 7)||8.16 (3.85)||0.94|
|Problem Finding||0.01 (0.12)||0.44|
|Flexible Thinking||0.17 (0.43)||0.82|
We asked multiple items to measure tolerance, historical empathy, and the extent to which students were developing interest in art museums. Because we had theoretical reasons for expecting that these items measured the same underlying constructs and for brevity of presentation, we standardized and combined items into three scales representing each of these constructs. Cronbach’s Alpha tests show that the items reliably measure historical empathy and developing interest in art museums. The Cronbach’s Alpha for the tolerance scale, however, falls short of conventional standards for reliably measuring the same underlying construct. We nevertheless present the tolerance result as a combined scale for a few reasons. Presenting the four items in the scale separately would still show a positive relationship between school tours and tolerance and would just be less parsimonious. The consistent empirical result and our theoretical expectation that these items measure the same construct overcome our concerns about a weaker than normal Cronbach’s Alpha.
Appendix Table 6: Cronbach’s Alpha for Outcome Scales
|Scale||Number of Items||Cronbach’s Alpha|