Thursday, October 25, 2012

Data-centric science - Correlation and causation

Yesterday I defined a model as a description that involves a cause-and-effect relationship between phenomena. In contrast, a data-centric approach to science looks only for correlations between data sets to answer scientific problems. This approach relies on very large data sets to come to accurate conclusions.

After thinking about yesterday's post I realized that there is a relationship in my arguments to the common admonishment "correlation does not imply causation." This fallacy is most often made when complex systems made of many interconnected parts are involved, such as in human health. Statements like "taking vitamin C tablets will cause me to not get sick" and "eating vegetables prevents me from getting cancer" are statements about cause-and-effect. As we have been taught again and again, though, taking vitamin C tablets may only decrease the chances that I get sick.

So here is the dichotomy that I was looking for: model-based science is useful in simple systems for which I may make cause-and-effect statements. Data-centric science is more useful for complex, coupled systems for which causality is a poor descriptor.

This is certainly a new way of thinking. Depending on the complexity of what we are observing, we should either employ or abandon causality as a means of interpretation.