Wednesday, October 24, 2012

Data-centric science - What is a model?

Chris Anderson's The End of Theory: The Data Deluge Makes the Scientific Method Obsolete suggests that models may no longer be necessary to solving scientific problems due to the large amount of data now contained in databases across the globe. Rather, looking for correlations between events may be enough to solve these problems.

I'm going to assume that the article's title is an overstatement; not all scientific problems may be solved with a data-centric approach. Some are very well suited to this method, however. To discern between these types of problems, I think it's necessary to first address the question "what is a model?" After this is answered, I hope to address why models may sometimes be circumvented.

Wikipedia's site on the disambiguation of the word model is quite long. It can mean many things within a scientific context. However, several words continuously appear on this page and its links: description, simulation, representation, framework. More informative (albeit complicated) is the explanation found at the Stanford Encyclopedia of Philosophy. The central question to this post is addressed on this site in Section 2: Ontology. A model may be a physical or fictional object, a description, an equation, or a number of other things.

Based on this information I think it's reasonable to state that a model is an attempt at replicating the behavior of some phenomenon, whether physically or as a result of an application of logical rules. I think further that a model establishes cause-and-effect relationships to do this. For example, Newton's theory of gravity contained the idea that something (gravity) caused the apple to fall. As another example, energy input from the ocean causes (among other things) hurricanes in weather models.

Models satisfy some human desire for causality. I read once (though I don't remember where) that people use reasoning as a coping mechanism for emotionally difficult situations, such as when a loved one dies. Somehow, finding a reason or a cause for things provides us some degree of comfort.

The data-centric approach to scientific problem solving obviates the establishment of a cause-and-effect relationship. Insurance companies don't need to know why married, twenty-something men get in fewer car wrecks than their single companions in order to charge them less. Instead, they only need to know whether this is true.

But other than to make ourselves feel good, why would we need to find a cause-and-effect relationship in the first place? I think that this could be because the ability to make correct predictions is an important part of any model. We make predictions when we're unable to carry out an experiment easily or when we don't have enough data already to answer a question. It is my suspicion that cause-and-effect relationships are central to a model's ability of prediction, though I'm not sure how right now.

So, in summary, a model is a physical or mental construct meant to replicate the behavior of some phenomenon or system. I believe that the main difference between a model-based approach to science and a data-centric approach is that a model-based approach creates a causal chain of events that describe an observation. I don't necessarily see this chain ever ending. Once we determine a cause, we might wonder what caused the cause. And what caused the cause that caused the cause? At some point, data-centric science responds with "Enough! Just give me plenty of data and I will tell you if two events are correlated." That's all we can really hope for, anyway.

Notes: The never-ending chain of causes sounds very familiar to Pirsig's never-ending chain of hypotheses in Zen and the Art of Motorcycle Maintenance. Is there a connection?

Also, I remember E. T. Jaynes arguing in Probability Theory: The Logic of Science that we can't really know an event will occur with 100% probability. This seems to suggest that cause-and-effect relationships do not really exist. Otherwise, we would always know the outcome of some cause. And if they don't really exist but are actually good approximations, then models really are what we've been told since middle-school science: imperfect and intrinsically human attempts at describing the world.

Note, October 25, 2012: I wrote this post late last night after having had a beer with dinner, so my mind wasn't as clear as when I normally write these posts. I realized this morning that the reason for building cause-and-effect relationships is that we can control a phenomenon if we know its proper cause. Many things are correlated, but a fewer number of things is linked by a causal relationship. Therefore, accurate models provide us the ability to control the outcome of an experiment, not just predict it. I don't believe that correlative analytics necessarily allow us to do this.