An experiment, also known as an A/B test or randomized controlled trial, is a research method used to answer questions of cause and effect. Does some factor “X” affect outcome “Y”.

Imagine you’ve noticed that when you’re wearing a hat people tend to compliment you. You may think the hat “causes” the compliments, but people may also be complimenting you when you’re not wearing a hat, you just don’t notice. Or it could be that you only wear a hat when you’re with a certain group of friends who just compliment more in general (reverse causality). Or maybe you only wear a hat when it’s sunny, and people are in a better mood when it’s sunny.

An experiment takes care of these issues by providing a comparison or “**control**” condition (days when you’re not wearing a hat) and “**randomizing**” the thing of interest (hat vs. no hat). Let a coin flip decide whether you wear a hat on certain days, and if you still get more compliments the hat is likely why, because now it isn’t the sun or your friends causing the hat. It’s the coin.

The experiment process is pretty simple once you understand the two basic ingredients of control and random assignment. So we’ve broken it down into 7 basic steps.

Imagine you’ve noticed that when you’re wearing a hat people tend to compliment you. You may think the hat “causes” the compliments, but people may also be complimenting you when you’re not wearing a hat, you just don’t notice. Or it could be that you only wear a hat when you’re with a certain group of friends who just compliment more in general (reverse causality). Or maybe you only wear a hat when it’s sunny, and people are in a better mood when it’s sunny.

An experiment takes care of these issues by providing a comparison or “

The experiment process is pretty simple once you understand the two basic ingredients of control and random assignment. So we’ve broken it down into 7 basic steps.

1. Start with a question: Does some factor “X” cause outcome “Y”?

2. Choose how you will measure outcome “Y”.

3. Decide how you will present your factor “X” to people.

4. Make a Control condition identical to #3 above, but without factor “X”

5. Randomize your causal factor “X”.

6. Launch your experiment and collect your outcome data.

7. Analyze your results with the appropriate statistical test, comparing the averages.

It’s that simple. Experiments can take on more complex designs, such as multiple “X”s or outcome measures, but those are a bit beyond our scope here. Nevertheless, there are a few other important considerations you should pay attention to if you want to design a quality experiment.

The number of people who will participate in your experiment is very important, as this affects how confident you can be that your experiment results are not just due to chance. The more people who participate in your experiment, the more likely you’d find similar results if you ran the experiment again.

Consider flipping a coin. If you do it only 4 times, you may end up with 75% or even 100% of one side winning. But if you flip a coin 1,000 times, the outcome is much more likely to be around 50-50.

So how many people should take your experiment? Here are some general guidelines:

60 people is usually enough to detect large effects between two conditions.

(e.g., a 1-point difference on a 1-5 survey scale).

120 people is usually enough to detect medium effects.

(e.g., about 2/3 of a point difference on a 1-5 scale).

200-400 people or more are generally required for small effects.

(e.g., about 200 people for a 1/2 point difference, about 450 people for a 1/3 point).

There are a variety of free sample size calculators (e.g. GPower) that can help you determine a reasonable sample size, given your required level of confidence over the results and the estimated effect size.

Hand-in-hand with sample size is the effect size that your causal factor has on the outcome. This is basically the average difference in your outcome between your control condition and casual factor condition.

For example, if survey takers rate your photo with a hat on average at about 4.0 on a 1-5 scale, and the average rating of your photo without a hat is 3.0, your effect size is 1.0.

How do you know whether that’s large, medium, or small? We can compute a standardized effect size, which makes our result more comparable to other experiments.

For example, the formula for Cohen’s d is simply the difference in average outcomes between your two conditions, divided by the sample’s standard deviation (a measure of how spread out your data is). In our hat example, if your sample’s standard deviation is 1.25, simply take 1.0 divided by 1.25, and you have a standardized effect size of 0.80, which is Cohen’s threshold for a large effect.

But wait. If we’ve never run the experiment before, how would we know the effect size? You may not. And that’s ok. You can provide your best guess, or decide what effect size is the smallest you’d be comfortable with and then plan your sample size accordingly.

For example, if you don’t care about any effect of wearing a hat that’s smaller than a 1/2 point difference on a 1-5 scale, a sample size calculator would tell you that recruiting 200 people would give you an 80% chance of finding the 1/2 point effect, if it actually exists. If you run the experiment and you don’t find a significant result, it’s likely because the difference is smaller than half a point, maybe even closer to zero.

If you hate math, you’ll love experiments. Well, at least relative to other research methods. Unlike big data, experiments rely on only a handful of basic statistics, all of which can be computed using free online software.

- Sample Size (number of people in your experiment)
- Mean (average) outcome in each condition
- Standard Deviation (how spread-out your outcome data is)

These numbers can be plugged into a t-test calculator which will analyze the difference in averages between your control and treatment groups. The test will spit out a probability value (“p-value”), which tells us the probability of finding results at least this extreme if the real difference was zero. The smaller the p-value, the better. If the p-value is below 0.05, we consider the result "statistically significant" since we can be 95% confident that we wouldn't find a difference this large from our data if the real difference was actually zero.

From thinking of your cause-effect question to analyzing your results, you now have the basic tools to design an experiment. Of course, there are other considerations to truly master the art and science of experiments. But mastery isn’t required to benefit from this research tool. Following the steps above will put you well on your way to gaining valuable insight into cause-effect relationships you care about.

Curious to see an experiment in action? Check out the article below!