Causality and Prediction

Philosophy of Data Science Part II

Nov 09, 2023

When I entered into data science I experienced a lot of confusion that I couldn’t even articulate, let alone resolve. In hindsight, most of that confusion fell into what I’d call model selection - deciding what mathematical model to use to represent the real world. While Part I of this series focused on bringing more nuance to the idea that “data-driven = better”, this article (Part II) will dive into two different modeling approaches. The first reigns supreme in computer science/machine learning: building models that are empirically predictive on representative datasets, without trying to understand the underlying data-generating mechanism. The second approach is common in economics: seeking accurate explanations of phenomena by breaking down the causal relationships between variables. In my experience as a student in both fields, people in each modeling camp don’t have much exposure to the other side, so I hope that by expanding on each approach in this article members of both camps can draw out some broader lessons about how to make more predictive, and thus more useful models.

In my undergrad, I was immersed in the machine learning world, where the focus is very explicitly on making good predictions as opposed to modeling the underlying mechanism. Their general approach to making predictive models is pretty intuitive: choose the model that is in some sense the most empirically predictive on existing instances of a problem, and use that model on future instances of the problem. However, there were recurring scenarios I ran into on the job that really confused me. Here are a few representative ones:

Translating the coefficients of a machine learning model into interpretable business strategy recommendations. How much could those coefficients/recommendations be trusted?
Choosing which variables to use as inputs to a model that predicts when the Bank of Canada will change interest rates. With the limited knowledge of economic theory I had at the time, this was a very ad hoc process of picking variables that seemed vaguely relevant and using SHAP values or “feature importance” scores to decide which variables to include. But what was an appropriate threshold for deciding whether a variable was “important” enough to actually be related to interest rates?
Simpson’s paradox, or the phenomenon where a university could have drastically higher overall admission rates for men than women, but lower admission rates for men than women in every single subject. If both can be simultaneously true, how meaningful is either statistic as evidence for gender bias?
Trying to connect any of the theory I learned in my statistics classes to the professional data science work I was doing (I may be exaggerating, but only slightly!).

I started to get some clarity when I saw the Rubin potential outcomes framework during my master’s, which explicitly models an individual’s outcome in different counterfactual scenarios. That way you can talk about, for example, someone’s health in a scenario where they go to the hospital vs. a scenario in which they don’t, and thus isolate the causal effect of going to the hospital on their health.

More concretely, each patient i might have “health” Y₁(i) if they go to the hospital, and Y₀(i) if they stay home. The causal effect of going to the hospital on patient i’s health is defined as Y₁(i) - Y₀(i). Of course, you only actually observe one of those options for an individual at a given point in time, so you can’t actually know the causal effect. But if you have n people at the hospital and m people who stayed home, and you have reason to believe that the ones who went to the hospital were drawn from the same distribution as the ones who didn’t, then you can estimate the Average Causal Effect, more commonly known as the Average Treatment Effect (ATE) via:

\(ATE = \frac{\sum_{i=1}^nY_1(i)}{n} - \frac{\sum_{j=1}^mY_0(j)}{m}\)

In practice, why might you have reason to think the people at the hospital came from the same distribution as the ones at home? Maybe they were both part of your experiment population and you randomly assigned them to go to the hospital or not. Or maybe you know enough about how the data was generated to believe one of the many assumptions that lead to other tools in the causal inference literature, like differences in differences, or regression discontinuity.

The reason the potential outcomes framework didn’t apply to my previous work as a data scientist is that I had only ever been analyzing data that was not the result of a meticulous experiment - it was observational data. Here’s what could go wrong with treating observational data like it came from an experiment: if you were just given a real-world dataset of health outcomes for people at the hospital and people at home, and tried to compute an average treatment effect using the formula above, you’d erroneously conclude that hospitals make people sicker. But you’d have ignored the fact that people going to the hospital would probably have been sick even if they hadn’t gone to the hospital!

Seeing causation defined made something click for me. I had often been warned that “correlation is not causation”, but never previously been given a definition of what causation is. And this is important, because any time you try to explain why some phenomenon occurred, you’re invariably making a causal claim. Because I couldn’t have reasonably arrived at an average treatment effect in any of my past work, I hadn’t been able to make sound causal claims, and my efforts to arrive at valid explanations had been somewhat in vain.

At this point, you might respond that this is simply too strict a definition of a valid explanation and that interpreting a predictive model causally is a decent approximation - besides, as data scientists we’re in the business of making useful approximations. I think that’s fine, but it’s important to be aware of the limits of your approximations. The problem with relying on a predictive model to explain a phenomenon is that you can have multiple equally predictive models with completely different feature weights. If you had landed on a different final predictive model, your explanation may have been completely different.

Coming back to the examples from the beginning, the common theme was that they required attention not only to empirical predictive accuracy but also to the underlying mechanism. In more detail:

(Business recommendations from model coefficients) I was trying to get causal interpretations from a predictive model. For reasons just described, this approach leads to explanations that may or may not have been correct, though might have still been useful approximations.
(Feature importance & predicting interest rates) Asking whether a variable’s feature importance score made it important enough to actually be related to interest rates was mixing up a statement about the predictive model (feature importance) with a statement about the underlying mechanism (the true relationship between the variable in question and interest rates). Feature importance and SHAP values say nothing about whether the variable is actually related to interest rates, only about their impact on that specific model. This example also highlights that even in a machine learning setting, you do have to take a weak stance on the underlying mechanism, if just to choose your input variables. The best practice would have been to choose variables based on how their inclusion impacted the model’s predictive performance (which was not great by the way - I would not recommend trying to use ML to predict Bank of Canada rate changes unless you get utility from futility), but with limited training data and infinite possible variables/combinations some assumptions must be made.
(Simpson’s paradox) Without some causal framework, it’s hard to make sense of Simpson’s paradox. A circumstance similar to the university admissions example actually arose at UC Berkeley in 1973, the confounding variable being that women disproportionately applied to more competitive programs.

(Connecting statistical theory to work) The statistics you first learn about in school (point estimates, confidence intervals, hypothesis tests) were literally invented for use in experiments. In the early 20th century, Ronald Fisher revolutionized agricultural production by moving away from collecting observational data towards conducting agricultural experiments, and he invented many of the basic tools we use today to analyze results. No wonder I had a difficult time connecting my work with observational data to my education.

So Which Approach is Better?

Given its crucial role in explaining phenomena, I find it a bit surprising that I hadn’t seen a definition of causality earlier. However, isolating causal effects is often difficult, if not impossible, and combining knowledge of causal relationships to arrive at predictions is not straightforward. And after all, as the ML folks might argue, how valuable is an explanatory model really if it isn’t good at making predictions?

Economists on the other hand would zero in on targeting the main assumption underlying most machine learning approaches, which is that the future will be similar to the past. Sometimes, in fact, you even know that that’s not a good assumption (maybe you’re a government implementing a completely new policy), and if you can include aspects of how the future will be different in your model then you can make better predictions.

In terms of which approach is actually leading to more predictive models in practice, it might seem like prediction is winning big-time. Natural language processing is a great example: recent advances in predictive modeling have resulted in far more capable models (eg. ChatGPT) than efforts that try to more explicitly model the mechanism that results in human language. Computer vision and recommendation systems are other fields in which the predictive approach has really excelled, with phenomenal breakthroughs in recent years.

But explanatory models absolutely have their place. Take quantum electrodynamics, for example. Judged based on the accuracy of its predictions, it’s the most accurate scientific theory ever. Granted, the words we use to explain quantum physics are a loose interpretation of mathematical theories that it’s a stretch to say we “understand”. But it is qualitatively very different from ChatGPT. The inner workings of ChatGPT can essentially be understood with just high-school math and matrices (its complexity comes from the sheer number of numbers required to describe the model in full), and it arose from some very clever optimization techniques and a lot of computing power. Quantum physics, on the other hand, is built out of some concise but very complex mathematical machinery and presumably arose from scientists accumulating causal relationships. In that sense, it’s the result of efforts to explain, and its success demonstrates that there’s a lot of value to theories that are capable of capturing the essence of core ideas and communicating those to other scientists effectively, even if they’re not as immediately useful for making predictions. It’s possible that the iterative modeling work of a group of scientists over time can eventually result in more predictive models than brute force optimization - after all, isn’t this how science has progressed for most of the past 300 years?

In the end, if you’re trying to make models that predict well, efforts to predict and to explain clearly both have their place. If you have any thoughts about what characterizes the problems where one approach is more successful than the other, I’d be curious to hear them! Ultimately, focusing solely on predicting and focusing solely on the underlying mechanism are two theoretical extremes, with practical situations necessitating some attention to both. That said, I found that just becoming explicitly aware of both approaches has helped me structure my thinking about modeling and truth, resolving another piece of the puzzle that is applied statistics.

Thanks to Thomas Nguyen for feedback on this article. Stay tuned for more posts getting at more pieces of this puzzle!

References and Further Reading

Machine Learning: An Applied Econometric Approach, an article by Mullainathan & Spiess (link)

Contains a very clear description of the predictive modeling approach

When Do We Actually Need Causal Inference?, a talk by Sean Taylor (link) - thanks to Branko Boscovic for sending this my way

Distills a lot of data science work down to a single formula
Presents the compelling perspective that a useful model should tell you what to do, and that to evaluate the effect of different possible actions you need to model the underlying mechanism

The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, a book by David Salsburg (link)

A fascinating and accessible book about the development of statistics over the twentieth century, with more high-level ideas than dense math

Causal Inference: The Mixtape, a textbook by Scott Cunningham (link)

A textbook for diving more deeply into causal inference

To Explain Or To Predict?, an article by Galit Shmueli (link)

An in-depth comparison of the causal and predictive modeling approaches

Statistical Modelling: The Two Cultures, an article by Leo Breiman (link)

An influential paper in statistics advocating for a move away from focusing on mechanisms towards focusing on black-box prediction

Jordan’s Substack

Discussion about this post

Ready for more?