Let's say you're a scientist. You come up with a prediction. Let's say you have solid theoretical reasons for thinking that if you give children caffeine, they will show higher levels of energy. You decide to run an experiment to test this prediction. So you go to wherever kids hang out, and you grab a bunch of them. You randomly split them into two groups, a control group and an experimental group. You give each kid in your control group some decaffeinated coffee. You give each kid in your experimental group a couple shots of espresso. And then you measure their energy levels somehow (say, time how long they can sit still). Great, now you've got some data. Time to analyze it. (Incidentally, you should probably also return the kids to wherever you found them...and lay low for a few days).
In standard NHST approaches, you'll basically see how far apart the groups are on your outcome measure (the difference in means between groups), relative to how much spread there is within the groups. Then you do a test to see if the difference between groups is greater than you'd expect by chance alone. So you check to see if your results are weirder than 95% of the samples that would be drawn from a population where there's no effect.
When you're doing your inferential test, you--by established convention--run a two tailed test. Essentially, this means that you predicted "I think caffeine will make kids more energetic," but when you run your stats you're actually testing something more like "did caffeine change kids' energy levels in any direction." That's right, you ask your stats software: "Does it look like caffeine either increased or decreased kids' energy." That isn't what you actually predicted! You are actually testing 1) whether your hypothesis is likely to be right, AND 2) whether the exact opposite of your hypothesis is right.
So, why do you test a hypothesis that you didn't actually have? As I mentioned, it's just a convention to run a two tailed test. You're just checking to see if your data are stranger than 95% of the "no effect" samples out there, and you don't care about the direction of weirdness. Let's say caffeine actually make the kids a lot less energetic...you'd still have a statistically significant effect.
Now, I've always found this a bit odd. If you have a good theoretical model that you're working from, you probably have a directional prediction (X will increase Y), rather than a nondirectional prediction (X will increase or decrease Y). But you use a nondirectional test of the hypothesis.
I think a two tailed test can have two problems. First, it lets you pass off a non-predicted statistically significant effect. Let's say you predicted "X increases Y" and you run a 2 tailed test and X actually looks to have reduced Y, you would just have to come up with a story about how you actually expected that all along! Second, two tailed tests are less powerful than one tailed tests in your specified direction. That old stand-by, the two tailed test, gives you less power to see if you're right, and more power to find something you didn't predict in the first place. Neither one of those outcomes is particularly good if you have directional hypotheses.
So, if two tailed tests are bad (at least when you have a directional hypothesis), why are they standard practice? I think basically it's because 1) they are more conservative, and 2) one tailed tests seem kind of suspicious. I mean, how can you tell if a researcher actually predicted the right directional effect? Maybe they tried a two tailed test, and it wasn't quiiiite significant (say, p = .068), so they decided to then run a one tailed test, which would make the results significant (p = .034). If you can't be sure that a given researcher actually had a clear directional hypothesis, it's easy to question their use of one tailed tests.
Preregistration: A Solution???
What if there was a way to clearly document that you had a directional hypothesis before you collected the data? That should put people's minds at ease, right?
Turns out, you can do exactly that. Services like Open Science let you preregister study methods, hypotheses, data analysis plans, and the like. Basically, you can create a time-stamped document that outlines your hypotheses and the state of the project. Then, when the project is done, you can make it publicly available. And everyone will get to see what your predictions were...before you saw the data.
Preregistration is also good for lots of other reasons. It's one way to guarantee transparency in research (a Good Thing). With a preregistration, there is no ambiguity about what authors actually predicted. There's no way for researchers to conveniently ignore all the measures and conditions that "didn't work" when they write up the final manuscript.
But, I've been wondering a bit about ways to incentivize preregistrations for individual researchers. After all, one can be a tremendously productive researcher by pairing opaque methods (p-hacking), and capitalizing on unexpected, yet explainable, results (HARKing: Hypothesizing After Results Known). Preregistrations might be great for science as a whole...but they will cut down on individual productivity.
But, if preregistrations make it more feasible for researchers to get away with one tailed tests, this could balance the playing field a bit.
Let's say you're studying something with an effect size of .5 (Cohen's d). In order to obtain power of .8, you'd need 128 participants with a two tailed test, but only 100 with a one tailed test (between groups design). The upshot of this is that, if you have access to a finite number of participants, you'll be able to run 28% more new studies while maintaining both power and Type I error rates. Win win win!
Or, another way to think about it is that if you're studying a medium effect (d= .5) and you only have 100 participants, you've got an 80% chance of significant results with a one tailed test, but only a 70% chance with a two tailed test. I'm not much of a gambler, but if you had to pick between a slot machine that pays out 70% of the time and one that pays out 80% of the time...you'd take 80% every time.
Finally, preregistration paired with a one tailed test might actually make a more compelling package of results.
Hypothetically, which of the following would you find most and least compelling in a paper?
1) p= .06, two tailed test ("marginally significant")
2) p = .03, one tailed test, no preregistration
3) p = .03, one tailed test, directional hypothesis preregistered
My initial intuition is that #2 is fishy...it's the least compelling. #1 is pretty "meh." Not a deal breaker, but not great either. #3 looks the strongest to me.
Feel free to weigh in in the comments. Do you get to run one tailed tests if you preregister your hypotheses? Is that enough of a "bonus" for researchers that it might nudge folks towards more preregistrations?