Post in a nutshell: This blog post is about Bayes factors, and a snippet of R code I put together to calculate Bayes factors for t tests one may encounter in the literature. I wanted to come up with a principled way to evaluate how well data support a researcher's tested hypothesis, relative to the null.

Update: To make things even easier, I put together an app that does everything described in this post for you. It's RIGHT HERE!

OK, go time.

Over the past several months, I've been playing around with some Bayesian stats stuff. There are tons of great resources out there outlining how Bayesian analysis works, and there are definitely different schools of thought.

One of the most interesting approaches concerns calculating Bayes factors. Probably the most accessible intro I've seen comes from Alex Etz and his terrific blog series about Understanding Bayes. Check it out, you won't be disappointed!

Without wading into the details, Bayes factors are basically a way to see how well data support different hypotheses (again, check out Alex's blog and dig around a bit for details...this post isn't technical at all). Typically, we want to compare how well the data fit some specified alternative hypothesis, relative to the null hypothesis of "no effect." Sounds straightforward. But the devil's in the details. Really, the only tricky part is in specifying that pesky alternative hypothesis.

Some great stats software out there like JASP (download this, and use it!) and the BayesFactor package for R provide easy-to-use tools for calculating Bayes factors. JASP let's you do this with SPPS-style dropdown menus. BayesFactor requires some super-light R code.

In a nutshell, these approaches pit the null against an alternative hypothesis. Rather than a super-specific point alternative (such as effect size d = .8 EXACTLY), Bayes factors let you pit your null against an alternative hypothesis that spreads its bets out across a bunch of possible alternative effect sizes.

*clarificaiton: you could do a BF for a point alternative. But the prediction is rarely that specific.

So, for instance, you might have an alternative hypothesis that manipulation X should cause responses on measure Y to increase by a fair bit. This alternative is directional ("increase"), but kind of vague in terms of just how big an effect to expect. To illustrate it with a crappy Paint sketch, we might picture the null hypothesis as the red line, and the alternative as the blue shaded region:

So, your alternative would be that small effects are still more probable than big effects, but on the whole it's centered on a pretty big effect (say at d= .7). Alternatively, you might have a directional prediction, but expect much more modest effect sizes, like in this equally crappy sketch:

It's still directional, but now most of the "prediction" is falling on small effect sizes, and the "prediction" is centered around d = .2 or so.

The exact same data will return different Bayes factors for these two alternative hypotheses, relative to the same null. If the data actually show a big effect, then the first alternative will be a better fit (higher BF10). But if the effect is tiny, the data might actually support the null, relative to that first alternative. If your guess is "pretty big effect" and the data show a pretty tiny effect, the Bayes factor may tell you that the null is more likely than THAT SPECIFIC alternative hypothesis of a big effect. Those same data (small effect), on the other hand, might support the second alternative relative to the null.

To recap: the same data will give different Bayes factors for different alternatives.

Now, JASP and BayesFactor are great tools. But some feel that, for example, the default test is biased against small effects. But this isn't necessarily a problem with the Bayes factor approach. Rather, it's an indictment of the default alternative hypothesis built into JASP or BayesFactor. But both of these tools let you tinker with that alternative.

Here's a screen grab of JASP:

If you aren't a fan of the presets, it's easy to change things. If you want the alternative hypothesis to be directional (as in my crappy sketches), then you check a box for "Group one > Group two" or "Group one < Group two." And you can change the effect size specification with the "Cauchy prior width" box. As a default, it's set at .707, which just means that the middle (median) of the alternative hypothesis effect size distribution will be at d = .707. Set it to .231, and the distribution would be centered at d = .231. My sketches are pretty bad, but I'd say that's roughly what the two of them represent: prior widths of ~.7 and ~.25.

Why might the default test seem prejudiced against small effects? Well, it isn't really looking for them. Within social psychology, for example, the median effect size is something like .36. And only around 20% of effects are larger than d = .7. So the JASP and BayesFactor default setting just might not be a good fit for social psychology. But it's super-easy to change. For my money, .36 is probably a better default than .707 for social psychology, because that's where our effects are actually centered.

## Evaluating t tests out there

I spend a lot of time reading the psych literature, and I though it'd be fun to tinker with Bayes factors for evaluating evidence in the stuff I'm reading. Often, I'm interested in questions like...

Do these data support the author's hypothesis more than the null?

Bayes factors are a natural fit, and the BayesFactor package has some options for inputting test statistics when you don't have raw data. So I really wanted to calculate Bayes factors for some papers out there. But here's the catch: What settings to use for the alternative hypothesis?

I wanted my exploration to be fairly generous. It would be pretty easy to calculate something like a Bayes factor with a ludicrous alternative hypothesis and then say "see, these data actually support the null!" But it's not very useful to support the null, relative to a *hypothesis that nobody had in the first place*. Ideally, I'd want to do one of two things instead:

- See how well the data fit an alternative I'm interested in (informed by literature, etc), vis a vis the null
- See how well the data fit the hypothesis the author actually tested, vis a vis the null.

The first one is pretty easy: I come up with a plausible alternative (check the literature, find out what to expect, etc), set the prior scale width to that, and fire away.

The second one is trickier, as it's entirely possible that the authors didn't cleanly specify their alternative hypothesis of interest.

To tackle 2, I thought I'd let the researcher's methodological choices tell me what their alternative hypothesis was. At the very least, it tells me what hypothesis their study was well-designed to test. What size effect did they expect? Well, we can work backwards to this by looking at sample size. If someone runs 20 participants per condition, he or she is (whether consciously or not) designing a study that is adequately (80%) powered to detect only big effects (d ~ .9). On the other hand, if someone runs 200 per condition, he or she is capable of detecting smaller effects (d ~ .28). So, to infer what size effect a researcher is looking for (i.e., their alternative hypothesis), you can just figure out how big of an effect their study was well-designed to detect. Whether explicitly stated or not, in a sense, this *was* the alternative hypothesis the authors tested.

## The R function

I played around for a while with some options and eventually wrote up an R function for this. The code can be downloaded right HERE. The guts of the actual function you'd need to use end up being a command like this, after you load the packages and function in question:

t.bf(df, t, pow, onetail, Ha)

where you have to input stuff for df, t, and the rest of the gang. df and t are easy: they are just the stats that you pulled from the paper in question. All of the other inputs are optional, with explanations below. So really, all you need to do is run the following, if you ran across the following "As hypothesized, jumping on one foot significantly increased temporal discounting, t(38) = 2.04"

t.bf(df = 38, t = 2.04)

This will spit out a Bayes factor with the following features:

- It is directional, assuming that the authors hypothesized the right direction. I told you before, I want to be generous.
- It assumes the df comes from equal sized groups. Probably a dodgy assumption, but if all you have is df and t values, it's a good place to start
- For figuring out the anticipated effect size, it assumes they were trying for power=.8. If you want to change this, add an alternative power level (e.g., t.bf(df = 38, t = 2.04, pow = .5) for 50% power).
- It assumes that the researchers were using two-tailed tests with their t-test. If they were doing a one-tailed test, you modify the code to something like t.bf(df = 38, t = 2.04, onetail = T)
- As a default, it sets the prior scale width to the effect size gleaned from the power analysis. But you can change this by tweaking the "Ha" part of the code. It can also implement "small" (d = .2), "medium" (d= .5), "large" (d= .8), or "social" (d= .36) alternatives. If you want the alternative to be small, the code would be t.bf(df = 38, t = 2.04, Ha = "small"). If you want a "social" alternative, it becomes t.bf(df = 38, t = 2.04, Ha = "social").

So aside from optional specifications for researcher's intended power, use of one-tailed tests, and different alternatives, you just need the df and t value. I was shooting for user-friendliness, but with a few bare-bones options for customization. Not perfect, but hopefully of some use.

## Some examples

Let's put it to the test with some actual examples!

### 1. Physical and interpersonal warmth

In 2008, Williams and Bargh published a now-controversial set of results seeming to show an intimate connection between physical and interpersonal warmth. In one study, they reported that people holding warm coffee rated a target as more interpersonally warm than did people holding iced coffee, *F*(1, 39) = 4.08, *P* = 0.05. We can translate into t-test lingo, and use the following command to see how well the data support the hypothesis they tested, relative to the null:

t.bf(39, sqrt(4.08))

The Bayes factor is .97. Basically, that this study is almost entirely inconclusive. The data support the null and the tested alternative almost equally, with an ever-so-slight preference for the null.

### 2. High intensity religious rituals

In 2013, Dimitris Xygalatas and friends published a neat study looking at how high intensity religious rituals affect prosocial behavior. People observing or taking part in intense rituals like the one in this picture gave more money to local temples, t(84) = 4.61.

t.bf(84, 4.61)

The data support the tested alternative, relative to the null, BF10 = 7.79. Neato!

### 3. God bailing you out for nonmoral risk taking

Continuing the religion theme, in 2015, Kupor, Laurin, and Levav published a bunch of studies in which they reported that getting people to think about God made people more willing to take risks (skydiving, etc), with the logic that people might think God would bail them out if things got rough. In the first study (1a), participants played a word game in which some of them saw religious words. According to the authors, the religious primes made MTurkers report more willingness to engage in some risky behaviors, *t*(59) = 2.21, *p* = .031.

t.bf(59, 2.21)

The data ever so slightly favor the tested alternative over the null, BF10 = 1.33.

### 4. Analytic thinking among creationists and evolutionists

I recently reported some correlations between a tendency to engage in analytic thinking and people's endorsement of evolution vs. creationism. This proved controversial on the blogosphere (Link, Link). I have no interested in revisiting that particular quagmire. But this is also a cool illustration of the importance of considering reasonable alternative hypotheses that are (ideally) informed by the initial author's own choices.

One blog post proposed some likelihood tests (note: not exactly the Bayes factor stuff of this post) to explore whether creationists and evolutionists score differently on the CRT, an analytic thinking task. This post focused on a likelihood test comparing the null (no difference) versus the alternative hypothesis that evolution believers score .8 points higher on the CRT than creationists. This test favored the null something like 2000-to-1.

The catch is, .8 is a ginormous difference on the CRT. It's roughly equivalent to the difference on this analytic thinking task between random strangers on the street and students at Harvard or Princeton. And it's certainly a much bigger difference than I'd ever expect among students at the same university, grouped by something they happen to say they believe in or not. So the data are more consistent with "no difference" than they are with "implausibly huge difference" by a factor of 2000. But that isn't very useful information at all. More importantly, it doesn't approach the question I want answered in this blog post: How well do the data support the null, relative to *an informed* alternative derived from the researcher's actual study?

Going back to the raw data, in each of two studies, there were differences in analytic thinking between creationists and evolutionists, Study 1: t(475.33) = 3.26, p = .001; Study 2: t(443.78) = 3.36, p = .0008. How well do the data support the *alternative hypothesis actually tested*, relative to the null? Let's run the code:

t.bf(475.33, 3.26)

t.bf(443.78, 3.36)

In this case the Bayes factors both support the alternative, relative to the null. BF10 = 3.90, 4.19. Nothing knock-down conclusive. But the data are 4x more likely under the tested alternative than the null.

Just to illustrate the code some more, let's say I wanted to see BF for a "small" alternative (prior width = .2) and a "social" alternative (prior width = .36), what would that code look like?

t.bf(475.33, 3.26, Ha = "small)

t.bf(475.33, 3.26, Ha = "social")

This would give BF10 = 3.88 and 3.86, respectively. Not much different from inferring the effect size from the sample sizes I ran (no coincidence here: I ran big studies because I anticipated a small effect, consistent with available literature).

This one really illustrates the importance of taking the initial author's research design into consideration when looking into likelihoods. The data strongly prefer the null to an alternative hypothesis that to my knowledge nobody has ever had (difference of .8 on the CRT). But they do support informed and plausible alternative hypotheses that were actually tested, vis a vis the null. Interesting.

### 5. God primes and prosocial behavior

Why not complete the set with the religion theme? Back in 2007, Shariff and Norenzayan reported that religious priming (unscrambling some sentences with religious words) made people more generous in a Dictator Game, Study 1: t(48) = 3.69, p <.001.

t.bf(48, 3.69)

In this case, the data support the tested alternative relative to the null, BF10 = 4.62. Neato.

### 6. Feeling the Future

What the heck, why not look to Bem's "feeling the future" paper? This can highlight some of the flexibility of the R function, as Bem used one-tailed tests throughout. So when figuring out what alternative hypothesis his studies were designed to test, it's worth looking at what effect size Bem's studies were well-powered to detect in a one-tailed test. From Study 1, people were better able to "see" future erotic images than future nonerotic images, t(99) = 1.85, p = .031.

t.bf(99, 1.85, onetail=T)

Note here that the "onetail" command isn't asking whether the Bayes factor is directional...I'm a generous guy, so the Bayes factor part of the code is always directional, assuming that the authors picked the right effect direction. In this case, it's figuring out the effect size, based on the one-tailed t-test Bem reported. In this case, the data slightly support the null relative to the tested (one tailed) alternative, BF10 = .77.

## Coda

There you have it. I wanted to use Bayes factors to evaluate evidence in the published literature. Specifically, I wanted a tool to see how well the data support *the alternative hypothesis the authors actually designed a study to test*, relative to the null. So this function guesses what the alternative hypothesis was based on the choices the authors actually made. Maybe that's a good idea. Maybe not. But in the absence of actually knowing what an author's alternative hypothesis *was*, we can make some guesses based on the alternative hypothesis that the study was *well-designed to test*. And, hey, there's also some flexibility built in for varying power and the magnitude of expected effects (small, social, large, etc).

Now give it a try for yourself with the app.

Enjoy!

Caveats: Yeah, I probably messed something up in there. Apologies in advance. Also, I think the code is reasonably annotated. But again, maybe not. Drop a comment if you see room for improvement. And maybe the whole idea was nonsense from the get-go.