The study you're referring to above is
this one ...
Motivated Numeracy And Enlightened Self-Government by Dan M. Kahan, Ellen Peters, Erica Cantrell Dawson and Paul Slovic, Yale Law School Public Working Paper No. 307, Cultural Cognition Project Working Paper No. 116, September 3rd, 2013 [Full paper downloadable from
hereKahan et al, 2013 wrote:Abstract:
Why does public conflict over societal risks persist in the face of compelling and widely accessible scientific evidence? We conducted an experiment to probe two alternative answers: the “Science Comprehension Thesis” (SCT), which identifies defects in the public’s knowledge and reasoning capacities as the source of such controversies; and the “Identity-protective Cognition Thesis” (ICT) which treats cultural conflict as disabling the faculties that members of the public use to make sense of decision-relevant science. In our experiment, we presented subjects with a difficult problem that turned on their ability to draw valid causal inferences from empirical data. As expected, subjects highest in Numeracy — a measure of the ability and disposition to make use of quantitative information — did substantially better than less numerate ones when the data were presented as results from a study of a new skin-rash treatment. Also as expected, subjects’ responses became politically polarized — and even less accurate — when the same data were presented as results from the study of a gun-control ban. But contrary to the prediction of SCT, such polarization did not abate among subjects highest in Numeracy; instead, it increased. This outcome supported ICT, which predicted that more Numerate subjects would use their quantitative-reasoning capacity selectively to conform their interpretation of the data to the result most consistent with their political outlooks. We discuss the theoretical and practical significance of these findings.
The authors performed an interesting test in this paper. Having gathered a sample of people for the test, they then assigned to those people a 2×2 contingency test. However, one randomly selected group was given the test with the columns labelled in one manner, and the remainder were given the test with the columns switched, which means that the two groups should have reached entirely different conclusions, if they performed the computations correctly. The authors admitted that this was a hard problem to give to the test groups, because even mathematically astute people can fail to assess 2×2 contingency tests correctly. As the authors state in the paper:
Kahan et al, 2013 wrote:Correctly interpreting the data was expected to be difficult. Doing so requires assessing not just the absolute number of subjects who experienced positive outcomes (“rash better”) and negative ones (“rash worse”) in either or both conditions but rather comparing the ratio of those who experienced a pos-itive outcome to those who experienced a negative one in each condition. Comparing these ratios is essen-tial to detecting covariance between the treatment and the two outcomes, a necessary element of causal inference that confounds even many intelligent people (Stanovich 2009; Stanovich & West 1998).
This is where life got interesting. The authors further split the experiment, by subvidiving the groups. One subgroup was given the data, labelled as being data from a politically neutral topic (a medical test of a topical skin cream), whilst the other subgroup was given
the same data, but this time labelled as the outcome of a gun control policy.
So, the experiment was a four way test, with
all four groups being given the same essential data, but with different labelling.
Group A: medical test, labelled to point to the conclusion "skin cream worked"
Group B: medical test, labelled to point to the conclusion "skin cream didn't work"
Group C: gun control policy, labelled to point to the conclusion "gun control worked"
Group D: gun control policy, labelled to point to the conclusion "gun control didn't work"
From the paper again:
Kahan et al, 2013 wrote:Based on previous studies using the design reflected in this experiment, it is known that most people use one of two heuristic alternatives to this approach. The first involves comparing the number of outcomes in the upper left cell to the number in the upper right one (“A vs. B”). The other (“A vs. C”) involves comparing the numbers in the upper left and lower left cells (Wasserman, Dorner & Kao 1990).
Each of these heuristic strategies generates a recognizable species of invalid causal inference. “A vs. B” amounts to assessing a treatment without considering information from a control. “A vs. C” compares outcomes in the treatment and control but in a manner that neglects to consult information necessary to disentangle the impact of the treatment from other influences at work in both conditions.
In the real world, of course, use of either of these defective strategies—both of which amount to failing to use all the information necessary to make a valid causal inference—might still generate the correct answer. But for our study stimulus, the numbers for the cells of the contingency table were deliberately selected so that use of either heuristic strategy would generate an incorrect interpretation of the re-sults of the fictional skin-treatment experiment.
The second two versions of the experiment involved a gun-control measure (Figure 3). Subjects were instructed that a “city government was trying to decide whether to pass a law banning private citizens from carrying concealed handguns in public.” Government officials, subjects were told, were “unsure whether the law will be more likely to decrease crime by reducing the number of people carrying weapons or increase crime by making it harder for law-abiding citizens to defend themselves from violent criminals.” To address this question, researchers had divided cities into two groups: one consisting of cities that had recently enacted bans on concealed weapons and another that had no such bans. They then observed the number of cities that experienced “decreases in crime” and those that experienced “increases in crime” in the next year. Supplied that information once more in a 2x2 contingency table, subjects were instructed to indicate whether “cities that enacted a ban on carrying concealed handguns were more likely to have a decrease in crime” or instead “more likely to have an increase in crime than cities without bans.” The column headings on the 2x2 table were again manipulated, generating one version in which the data, properly interpreted, supported the conclusion that cities banning guns were more likely to experience increased crime relative to those that had not, and another version in which cities banning guns were more likely to experience decreased crime.
Overall, then, there were four experimental conditions—ones reflecting opposite experiment results for both the skin-treatment version of the problem and the gun-ban version. The design was a between-subjects ones, in which individuals were assigned to only one of these conditions. For sake of expository convenience, we will refer to the conditions as “rash decrease,” “rash increases,” “crime decrease,” and “crime increase,” with the label describing the result that a correct interpretation of the 2x2 contingency table would most support.
What happened when the authors conducted this experiment?
Kahan et al, 2013 wrote:3.5. Hypotheses
We formed three hypotheses. The first was that subjects high in numeracy would be more likely to get the right result in both skin-treatment conditions.
This hypothesis reflected results in previous studies. As indicated, such studies show that the co-variance-detection problem featured in this experiment is very difficult for most people to answer correct-ly (Stanovich 2009).
One recent study, however, shows that the likelihood of answering the problem correctly is pre-dicted by an individual’s score on the Cognitive Reflection Test (Toplak, West & Stanovich 2011). The CRT features a set of problems, each of which is designed to prompt an immediate and intuitively compelling response that is in fact incorrect. Because supplying the correct answer requires consciously stifling this intuition and logically deducing the right response, the CRT is understood to measure the disposition to use the slower, deliberate form of information-processing associated with System 2, as opposed to the rapid, heuristic-driven form associated with System 1.
The CRT requires elementary mathematical skills, but is not a numeracy test per se (Liberali, Reyna, Furlan, Stein & Pardo 2012). However, insofar as making valid causal inferences in the covariance-detection problem likewise demands suppressing the heuristic tendency to give decisive significance to suggestive but incomplete portions of the information reflected in the 2x2 contingency table, it is not surprising that individuals who score higher on CRT are more likely to correctly interpret the data the table contains.
We would expect Numeracy scale to be an even stronger predictor of how likely a person is to select the correct response in the skin-treatment versions of this problem. Like the CRT, Numeracy measures a disposition to subject intuition to critical interrogation in light of all available information—and thus to avoid mistakes characteristic of over-reliance on heuristic, System 1 information processing (Liberali et al. 2012). Indeed, two CRT items are conventionally included in the Numeracy scale (Weller, Dieckmann, Tusler, Mertz, Burns & Peters 2012), and we added the third in this study in order to rein-force its sensitivity to the disposition to preempt reliance on unverified intuition. However, whereas the CRT measures the disposition to use System 2 information processing generally, Numeracy measures how proficient individuals are in using it to reason with quantitative information in particular, a capacity specifically relevant to the covariance-detection problem featured in the stimulus.
The hypothesis that performance in the skin-treatment conditions would be positively correlated with Numeracy was common to SCT and ICT. The second and third hypotheses reflect opposing SCT and ICT predictions relating to results in the gun-ban conditions.
Whereas the issue in the skin-treatment versions of the covariance-detection problem—does a new skin cream improve or aggravate a commonplace and nonserious medical condition—is devoid of partisan significance, the question whether a gun ban increases or instead decreases crime is a high profile political one that provokes intense debate. Consistent with the growing literature on culturally or ideologically motivated reasoning (Jost, Hennes & Lavine in press), we anticipated that subjects in the gun-ban conditions would be more likely to construe the data as consistent with the position that prevails among persons who share their political outlooks—regardless of which version of the problem (“crime increases” or “crime decreases”) they were assigned. Specifically, we surmised that gratification of the interest subjects would have in confirmation of their predispositions would reinforce their tendency to engage in heuristic reasoning when subjects were assigned to the condition in which “A vs. B” or “A vs. C” generated a mistaken answer that was nonetheless congenial to their political outlooks. That ideologically motivated reasoning would compound heuristic reasoning in this way was specifically supported by studies showing that an existing position on a contested nonpolitical issue (Dawson & Gilovich 2000), aversion to threatening information (Dawson, Gilovich & Regan 2002), and prior beliefs (Stanovich & West 1998) can all magnify the sorts of reasoning errors frequently encountered in covariance-detection problems identical or closely related to the one featured in our stimulus.
I'll let you read the rest of the paper, and see the
huge difference in the graphs toward the end. It's both hilarious and disturbing to observe.