One of the perpetual challenges in my career as a modeler of biochemical systems has been the need to balance accuracy with reliability. This paradox is not as strange as it seems. Typically when you build a model you include a lot of approximations supposed to make the modeling process easier; ideally you want a model to be as simple as possible and contain as few parameters as possible. But this strategy does not work all the time since sometimes it turns out that in your drive for simplicity you have left a crucial factor out. So now you include this crucial factor, only to find that the uncertainties in your model go through the roof. What's happening in such unfortunate cases is that along with including the signal from the previously excluded factors, you have also inevitably included a large amount of noise. This noise can typically result from an incomplete knowledge of the factor, either from calculation or from measurement. Modelers of every stripe thus have to tread a fine balance between including as much of reality as possibility as possible and making the model accurate enough for quantitative explanation and prediction.

It seems that this is exactly the problem that has started bedeviling climate change models. A recent issue of Nature had a very interesting article on what seems to be a wholly paradoxical feature of models used in climate science; as the models are becoming increasingly realistic, they are also becoming less accurate and predictive because of growing uncertainties. I can only imagine this to be an excruciatingly painful fact for climate modelers who seem to be facing the equivalent of the Heisenberg uncertainty principle for their field. It's an especially worrisome time to deal with such issues since the modelers need to include their predictions in the next IPCC report on climate change which is due to be published this year.

A closer look at the models reveals that this behavior is not as paradoxical as it sounds, although it's still not clear how you would get around it. The article especially struck a chord with me since as I mentioned earlier, similar problems often plague models used in chemical and biological research. In case of climate change, the fact is that earlier models were crude and did not account for many fine-grained factors that are now being included (such as the rate at which ice falls through clouds). In principle and even in practice, there are a bewildering number of such factors (partly exemplified by the picture on top). Fortuitously, the crudeness of the models also prevented the uncertainties associated with these factors from being included in the modeling. The uncertainty remained hidden. Now that more real-world factors are being included, the uncertainties endemic in these factors reveal themselves and get tacked on to the models. You thus face an ironic tradeoff; as your models strive to mirror the real world better, they also become more uncertain. It's like swimming in quicksand; the harder you try to get out of it, the deeper you get sucked in.

This dilemma is not unheard of in the world of computational chemistry and biology. A lot of the models we currently use for predicting protein-drug interactions for instance are remarkably simple and yet accurate enough to be useful. Several reasons account for this unexpected accuracy; among them cancellation of errors (the Fermi principle), similarities of training sets to test sets and sometimes just plain luck. The similarity of training and test sets especially means that your models can be pretty good at explanation but can break down when it comes to prediction of even slightly dissimilar systems. In addition, error analysis is unfortunately not a priority in most of these studies, since the whole point is to publish correct results. Unless this culture changes our road to accurate prediction will be painfully slow.

Here's an example from my own field of how "more can be worse". For the last few months I have been using a very simple model to try to predict the diffusion of druglike molecules through cell membranes. This is an important problem in drug development since even your most stellar test-tube candidate will be worthless until it makes its way into cells. Cell membranes are hydrophobic (water-hating) while the water surrounding them is hydrophilic (water-loving). The ease with which a potential drug transfers from the surrounding water into the membrane depends among other factors on its solvation energy, that is on how readily the drug can shed water molecules; the smaller the solvation energy, the easier it is for drugs to get across. This simple model which calculates the solvation energy seems to do unusually well in predicting the diffusion of drugs across real cell membranes, a process that's much more complex than just solvation-desolvation.

One of the fundamental assumptions in the model I am using is that the molecule exists in just one conformation in both water and the membrane. A conformation of a molecule is like a yoga position for a human being; typical organic molecules with many rotatable bonds usually have thousands of possible conformations. The assumption of a single conformation is fundamentally false since in reality molecules are highly flexible creatures that interconvert between several conformations both in water and inside a cell membrane. To overcome this assumption, a recent paper explicitly calculated the conformations of the molecule in water and included this factor in the diffusion predictions. This was certainly more realistic. To their surprise, the authors found that making the calculation more realistic made the predictions worse. While the exact mix of factors responsible for this failure can be complicated to tease apart, what's likely happening is that the more realistic factors also bring more noise and uncertainty with them. This uncertainty piles up, errors that were likely canceling before no longer cancel, and the whole prediction becomes fuzzier and less useful.

I believe that this is what is partly happening in climate models. Including more real-life factors in the models does not mean that all those factors are well understood or tightly measured. You are inevitably introducing some known unknowns. Ill-understood factors will introduce more uncertainty. Well-understood factors will introduce less uncertainty. Ultimately the accuracy of the models will depend on the interplay between these two kinds of factors, and currently it seems that the rate of inclusion of new factors is higher than the rate at which those factors can be accurately calculated or measured.

The article goes on to note that in spite of this growing uncertainty the basic predictions of climate models are broadly consistent. However it also acknowledges the difficulty in explaining the growing uncertainty to a public which has become more skeptical of climate change since 2007 (when the last IPCC report was published). As a chemical modeler I can sympathize with the climate modelers.

But the lesson to take away from this dilemma is that crude models sometimes work better than more realistic ones. My favorite quote about models comes from the statistician George Box who said that "all models are wrong, but some are useful". It is a worthy endeavor to try to make models more realistic, but it is even more important to make them useful.

Note: As a passing thought it's worth pointing out some of the common problems that can severely limit the usefulness of any kind of model, whether it's one used for predicting the stock market, the global climate or the behavior of drugs, proteins and genes:

1. Overfitting: You fit the existing data so well that your model becomes a victim of its own success. It's stellar at explaining what's known but it's so overly dependent on every single data point that a slightly different distribution of data completely overwhelms its predictive power.

2. Outliers: On the other hand if you fit only a few data points and ignore the outliers, your model again runs the risk of failing when facing a dataset "enriched" in the outliers.

3. Generality vs specificity: If you build a model that predicts average behavior, it may turn out to have little use in predicting what happens under specific circumstances. Call this the bane of statistics itself if you will, but it certainly makes prediction harder.

4. Approximations: This is probably the one limitation inherent to every model since every model is based on approximations without which it would simply be too complex to be of any use. The trick is to know which approximations to employ and which ones to leave out, and to run enough tests to make sure that the ones that are left out still allow the model to explain most of the data. Approximations are also often dictated by expediency since even if a model can include every single parameter in theory, it may turn out to be prohibitively expensive in terms of computer time or cost. There are many good reasons to approximate, as long as you always remember that you have done so.

This an updated and revised version of a post on The Curious Wavefunction blog.