Tuesday, October 30, 2012

How does bias impact the physical sciences?

One of my favorite scientists, Ben Goldacre, recently posted a Ted Talk that he gave last year on how biases affect the methodology of food and drug trials and the reporting of results. One solution he proposes is an increase in transparency of reporting on science. Presumably, people are not willing to check all the facts because it is perceived as a tedious and unwelcome job.

It's worth watching and thinking about how similar biases affect the physical sciences. Some important biases to identify in our own work include
  1. Publication bias - The increased likelihood of publishing positive results over negative results.
  2. Experimenter's bias - The bias to perform an experiment in such a way that one is more likely to achieve an expected result.
Most information I've found on scientific bias comes from medicine and social sciences. Is this because they are more susceptible to bias, because the implications for error are larger, or because the physical sciences have not adequately addressed these issues?

Monday, October 29, 2012

Open-Access Explained by PhD Comics - and more Data-centric science

My girlfriend sent me this video link from PhD comics last week about open-access publishing. The video is narrated by Nick Shockey and Jonathon Eisen, brother to one of the co-creators of the Public Library of Science (commonly known to academics as PLoS).

One of the arguments presented by the two is that research should be free to re-use. I think this means that scientists should be able to use all of the knowledge and material presented in a journal paper to advance their own work. The narrators mention that the full content of papers should be searchable to easily find connections between works and to facilitate locating relevant papers.

This is another hint towards the shifting focus to a data-centric approach to science, only here the arguments for it are coming from the open-access movement.

I also want to add that I support this movement. I can't access a paper that I've written because our campus doesn't have access to the journal. And when I graduate, I will not have access to any of them unless I or my employer has the appropriate subscriptions. Ridiculous.

Thursday, October 25, 2012

Data-centric science - Correlation and causation

Yesterday I defined a model as a description that involves a cause-and-effect relationship between phenomena. In contrast, a data-centric approach to science looks only for correlations between data sets to answer scientific problems. This approach relies on very large data sets to come to accurate conclusions.

After thinking about yesterday's post I realized that there is a relationship in my arguments to the common admonishment "correlation does not imply causation." This fallacy is most often made when complex systems made of many interconnected parts are involved, such as in human health. Statements like "taking vitamin C tablets will cause me to not get sick" and "eating vegetables prevents me from getting cancer" are statements about cause-and-effect. As we have been taught again and again, though, taking vitamin C tablets may only decrease the chances that I get sick.

So here is the dichotomy that I was looking for: model-based science is useful in simple systems for which I may make cause-and-effect statements. Data-centric science is more useful for complex, coupled systems for which causality is a poor descriptor.

This is certainly a new way of thinking. Depending on the complexity of what we are observing, we should either employ or abandon causality as a means of interpretation.

Wednesday, October 24, 2012

Data-centric science - What is a model?

Chris Anderson's The End of Theory: The Data Deluge Makes the Scientific Method Obsolete suggests that models may no longer be necessary to solving scientific problems due to the large amount of data now contained in databases across the globe. Rather, looking for correlations between events may be enough to solve these problems.

I'm going to assume that the article's title is an overstatement; not all scientific problems may be solved with a data-centric approach. Some are very well suited to this method, however. To discern between these types of problems, I think it's necessary to first address the question "what is a model?" After this is answered, I hope to address why models may sometimes be circumvented.

Wikipedia's site on the disambiguation of the word model is quite long. It can mean many things within a scientific context. However, several words continuously appear on this page and its links: description, simulation, representation, framework. More informative (albeit complicated) is the explanation found at the Stanford Encyclopedia of Philosophy. The central question to this post is addressed on this site in Section 2: Ontology. A model may be a physical or fictional object, a description, an equation, or a number of other things.

Based on this information I think it's reasonable to state that a model is an attempt at replicating the behavior of some phenomenon, whether physically or as a result of an application of logical rules. I think further that a model establishes cause-and-effect relationships to do this. For example, Newton's theory of gravity contained the idea that something (gravity) caused the apple to fall. As another example, energy input from the ocean causes (among other things) hurricanes in weather models.

Models satisfy some human desire for causality. I read once (though I don't remember where) that people use reasoning as a coping mechanism for emotionally difficult situations, such as when a loved one dies. Somehow, finding a reason or a cause for things provides us some degree of comfort.

The data-centric approach to scientific problem solving obviates the establishment of a cause-and-effect relationship. Insurance companies don't need to know why married, twenty-something men get in fewer car wrecks than their single companions in order to charge them less. Instead, they only need to know whether this is true.

But other than to make ourselves feel good, why would we need to find a cause-and-effect relationship in the first place? I think that this could be because the ability to make correct predictions is an important part of any model. We make predictions when we're unable to carry out an experiment easily or when we don't have enough data already to answer a question. It is my suspicion that cause-and-effect relationships are central to a model's ability of prediction, though I'm not sure how right now.

So, in summary, a model is a physical or mental construct meant to replicate the behavior of some phenomenon or system. I believe that the main difference between a model-based approach to science and a data-centric approach is that a model-based approach creates a causal chain of events that describe an observation. I don't necessarily see this chain ever ending. Once we determine a cause, we might wonder what caused the cause. And what caused the cause that caused the cause? At some point, data-centric science responds with "Enough! Just give me plenty of data and I will tell you if two events are correlated." That's all we can really hope for, anyway.

Notes: The never-ending chain of causes sounds very familiar to Pirsig's never-ending chain of hypotheses in Zen and the Art of Motorcycle Maintenance. Is there a connection?

Also, I remember E. T. Jaynes arguing in Probability Theory: The Logic of Science that we can't really know an event will occur with 100% probability. This seems to suggest that cause-and-effect relationships do not really exist. Otherwise, we would always know the outcome of some cause. And if they don't really exist but are actually good approximations, then models really are what we've been told since middle-school science: imperfect and intrinsically human attempts at describing the world.

Note, October 25, 2012: I wrote this post late last night after having had a beer with dinner, so my mind wasn't as clear as when I normally write these posts. I realized this morning that the reason for building cause-and-effect relationships is that we can control a phenomenon if we know its proper cause. Many things are correlated, but a fewer number of things is linked by a causal relationship. Therefore, accurate models provide us the ability to control the outcome of an experiment, not just predict it. I don't believe that correlative analytics necessarily allow us to do this.

Tuesday, October 23, 2012

Data-centric science - My initial thoughts

"The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years... But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete."

This quote is from a 2008 article in Wired Magazine called "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" by Chris Anderson. In this article, Anderson addresses our ability to solve scientific problems by looking for correlations in data without the need to form models. This ability has been enabled by the huge amount of searchable data that the internet has generated over the past two decades, which has led us into the so-called Petabyte Age.

This approach, sometimes referred to as analytics, has been successfully employed to translate between written languages, sequence genomes, match advertising outlets to customers, and provide better healthcare to people. Now, Anderson argues, it may be applied to problems across the full range of sciences. This is a welcome evolution, partly because many fields now possess too many theories and lack the experiments to validate or deny their predictions. Take particle physics or molecular biology, for example. There are arguably more theories and models now about these systems  than ever before, and many of them can not be verified. A data-centric approach could solve this problem.

This is a very interesting idea and I've been thinking about it for a few weeks now. I think that, to make any sense of it, I need to address several issues and assumptions. Questions to consider include:
  • What is a model? When is it useful and when is it not?
  • Are only certain fields of science able to benefit from a data-centric approach?
  • What is the human component to research? How would it change if this approach was implemented?
  • What has already been done to solve scientific problems with data-driven solutions?
  • What are the philosophical implications to changing our idea of science? The scientific method has existed in some form or another for almost 2000 years (I'm referring all the way back to Aristotle, even if his ideas contained flaws). A significant change to the scientific method, especially given its importance to modern society, could have major sociological consequences.
I'll consider these questions in future posts.

Thursday, October 18, 2012

Come see my talk at FiO

I'm giving a talk on optically controlled active media today at Frontiers in Optics. The talk number is FTh3D.7 and it's in Highland E.

The talk is about our work concerning the optical forces on colloidal particles in 3D, space and time dependent speckle. I'm pushing it as a model system for testing ideas from nonequilibrium thermodynamics, but I think that our method of treating the field-particle coupling is equally interesting.

Come check it out.

Monday, October 15, 2012

Tommorow at Frontiers in Optics

Right now I'm in Rochester, New York for this year's Frontiers in Optics conference hosted by the OSA. So far I've attended the plenary talks, which dealt with quantum optics and thermodynamics, 2D IR spectroscopy, retinal imaging, and, of course, the Higgs boson. In addition, I visited the Omega laser facility, which was incredibly fascinating. If you're in the area and you have any interest in incredibly powerful lasers or inertial confinement nuclear fusion, then I recommend making a visit.

Tomorrow I plan to visit some talks and work at the CREOL exhibition booth from 1:00 PM to 3:00 PM. Additionally, I'm giving a talk for a group mate who couldn't make it to the conference on Wednesday morning at 11:30 AM and my own talk on optically-controlled active media on Thursday at 3:00 PM. Briefly, the talk deals with solutions pumped by light to expand their free energy so that they may do carry out additional work. I hope the project will eventually be applied to controlling reaction kinetics in cells.

If  you're there, let me know and we can talk over coffee or a beer!

Friday, October 12, 2012

Better charts and graphics

I was recently asked to supply some annotated figures to a journal in a vector format, rather than the bitmap format that I had submitted them in. Unfortunately, I did not originally save the graphs in this format and had to hurriedly redo them with a program I was unfamiliar with. As you might expect, this wasn't a pleasant experience.

This event has lead me to offer the following tip: when you're making charts, figures, and other graphics that you intend to publish, be sure you save them in a production-quality format at the step where they are generated. It might mean a bit more time as you make them, but it saves a lot of hassle in the long run.

It's also worthwhile to explore the tools that are available to you at your institution for making figures. I am most comfortable with visualizing and analyzing data in Matlab, but I feel like too much effort is needed to make the plots look nice. Origin is a common alternative, but I find that it is not so intuitive to use and that there is a paucity of tutorials available. I also sometimes use free packages like SciPy along with Inkscape and GIMP to prepare for when I may find myself in a situation where I don't have access to (expensive) commercial options. Ultimately, it seems like the best approach is to use many different visualization tools depending on your purpose for the plot.

Here is an old but useful post on making production-quality graphs in Matlab. Has this process become simpler?

Addendum: I just found this on the Matlab File Exchange: plot2svg.m. I haven't yet tried it, but it should convert your Matlab plots to the .svg format, which is readable by Inkscape and a W3C standard.