Wednesday, November 9, 2011

My relationship with curve fitting

My understanding of curve fitting has changed a lot since I took statistics in high school. Back then it was simply an exercise that produced a line through trivial data that my classmates and I had collected. The function for the line could predict the outcome of future experiments, and therein lay its usefulness.

In college, it's importance increased when I learned how to extract physically meaningful quantities from the fitting parameters. The fit was a tool to extract the information from the noise of experimental randomness. It became more complex as well—the types of models with which I could fit the data grew far beyond simple lines. Now models included Gaussians, decaying exponentials, and many other transcendental equations. The importance of curve fitting at this point of my education lay beyond simple prediction; it produced for me the reality that lay behind the noise, and this reality was encoded into the values of the fit parameters.Curve fitting had become absolute and always revealed the true physics behind some process.

Now, after four and a half years of graduate school I've learned that the human element to curve fitting is paramount. I no longer see it as the purely objective tool that I did before I received my B.S. The moment of change occurred when I realized that the results of a fit can be marginalized simply by adding too many parameters to the model (c.f. this post from Dr. Ross McKenzie where he noted a paper in Nature that contained a fit of a model containing 17 parameters to 20 plus data points). If one can fit an elephant to data using only five parameters, then clearly any other model, including one that a scientist is arguing for in a paper, can be made to "explain" data if it possesses enough free parameters. Furthermore, the initial values for the fitting procedure can change the outcome since the routine may settle on a local minimum in the solution space. Therefore, an educated guess performed by an informed human is a critical element to any curve fitting routine.

My experiences with curve fitting in graduate school have completely transformed my opinion of its value. It certainly no longer appears to me as an absolute tool. I'm also much more careful when assessing conclusions in papers that employ some sort of regression since I've personally experienced many of its pitfalls.

I think it's incredibly important to make undergraduates aware that curve fitting goes beyond a simple exercise of plugging data into a computer and clicking "Go." Both intuition about the physics that generated the data and the ability to make objective judgements about the value of a model are crucial to making sound conclusions. What is the variability in the parameters with the range of data included in the fit? Do the parameters represent physical quantities or are they used to simply facilitate further calculations? What is the degree of confidence in the fit parameters? Are there too many free parameters in the model? Is the original data logarithmic, and, if so, was the fit performed on a logarithmic or linear scale? All of these questions and more should be addressed before presenting results based on a fitting procedure.