Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed. The science of why things occur is called etiology. Causal inference is an example of causal reasoning.
In statistics, causation is a bit tricky. As you’ve no doubt heard, correlation doesn’t necessarily imply causation. An association or correlation between variables simply indicates that the values vary together. It does not necessarily suggest that changes in one variable cause changes in the other variable. Proving causality can be difficult.
Relationships and Correlation
The expression is, “correlation does not imply causation.” Consequently, you might think that it applies to things like Pearson’s correlation coefficient. And, it does apply to that statistic. However, we’re really talking about relationships between variables in a broader context. Pearson’s is for two continuous variables. However, a relationship can involve different types of variables such as categorical variables, counts, binary data, and so on.
For example, in a medical experiment, you might have a categorical variable that defines which treatment group subjects belong to control group, placebo group, and several different treatment groups. If the health outcome is a continuous variable, you can assess the differences between group means. If the means differ by group, then you can say that mean health outcomes depend on the treatment group. There’s a correlation, or relationship, between the type of treatment and health outcome. Or, maybe we have the treatment groups and the outcome is binary, say infected and not infected. In that case, we’d compare group proportions of the infected/not infected between groups to determine whether treatment correlates with infection rates.
Through this post, I’ll refer to correlation and relationships in this broader sense not just literal correlation coefficients. But relationships between variables, such as differences between group means and proportions, regression coefficients, associations between pairs of categorical variables, and so on.
Causation and Hypothesis Tests
Before moving on to determining whether a relationship is causal, let’s take a moment to reflect on why statistically significant hypothesis test results do not signify causation.
Hypothesis tests are inferential procedures. They allow you to use relatively small samples to draw conclusions about entire populations. For the topic of causation, we need to understand what statistical significance means.
When you see a relationship in sample data, whether it is a correlation coefficient, a difference between group means, or a regression coefficient, hypothesis tests help you determine whether your sample provides sufficient evidence to conclude that the relationship exists in the population. You can see it in your sample, but you need to know whether it exists in the population. It’s possible that random sampling error (i.e., luck of the draw) produced the “relationship” in your sample.
Statistical significance indicates that you have sufficient evidence to conclude that the relationship you observe in the sample also exists in the population.
Hill’s Criteria of Causation
Determining whether a causal relationship exists requires far more in-depth subject area knowledge and contextual information than you can include in a hypothesis test. In 1965, Austin Hill, a medical statistician, tackled this question in a paper that’s become the standard. While he introduced it in the context of epidemiological research, you can apply the ideas to other fields.
Hill describes nine criteria to help establish causal connections. The goal is to satisfy as many criteria possible. No single criterion is sufficient. However, it’s often impossible to meet all the criteria. These criteria are an exercise in critical thought. They show you how to think about determining causation and highlight essential qualities to consider.
Correlation mean causation
Even if there is a correlation between two variables, we cannot conclude that one variable causes a change in the other. This relationship could be coincidental, or a third factor may be causing both variables to change.
For example, Ankit collected data on the sales of ice cream cones and air conditioners in his hometown. He found that when ice cream sales were low, air conditioner sales tended to be low and that when ice cream sales were high, air conditioner sales tended to be high.
- Ankit can conclude that sales of ice cream cones and air conditioner are positively correlated.
- Ankit can’t conclude that selling more ice cream cones causes more air conditioners to be sold. It is likely that the increases in the sales of both ice cream cones and air conditioners are caused by a third factor, an increase in temperature!