January 2, 2013 | 3
This is an opinion piece I wrote for a new planned journal on open computational research that for one reason or another failed to take off. Hopefully the journal will be resurrected or another will take its place since this is an important topic. Simulation and modeling have now become robust and frequent paradigms in almost every scientific endeavor and their continued relevance and use will depend on holding computational results to the same standards that are routinely applied to theoretical and experimental work. My piece tries to ask how computational research can be communicated most effectively and honestly in an age of open scientific publishing. Feedback will be most welcome since this is necessarily going to be a community endeavor.
New ways of doing science demand new standards
In the last three decades or so, along with theory and experiment, modeling and simulation have become enshrined as the third leg in the methodology of science. Although not as hallowed in its history as theory and experiment, modeling is now widely used to complement – and occasionally supplement – the results of theoretical and experimental investigations. As science has become both more complex and more multidisciplinary, it has been essential to resort to models to explain, understand and predict. Many of the most pressing and exciting problems in modern science, from understanding the brain to simulating biological networks to mapping the large-scale structure of the cosmos, involve complex, multifactorial phenomena that are not amenable to first-principles solutions. Analyzing such problems entails a judicious mix of rigorous theorizing, statistical analysis and empirical guesswork, so it is inevitable that model building will play an increasingly important role in deciphering the workings of these real-world physical, biological and engineered systems.
As with any philosophy of doing science, modeling has to conform to the time-honored principles and constraints that have contributed to modern science’s enormous growth and practical utility over the last five hundred years or so. Foremost among these constraints is the accurate and reproducible communication of results which lies at the foundation of scientific inquiry. Accuracy is essential for understanding the mechanistic workings of a particular method and its causal connections to the observed result. Reproducibility is at the heart of the scientific method, a basic requirement without which it becomes impossible to trust and validate any scientific study. As modeling and simulation have become mainstays of modern scientific research, they have brought with them unique challenges pertaining to both these methodological aspects.
Galileo provided an early and striking instance of reproducibility in modern science when he constructed his telescope and made groundbreaking observations of the moons of Jupiter. As the story goes, he painstakingly made several copies of the instrument and gave them to the crowned heads of Europe so that they too could see what he saw. In Galileo’s work we witness two cardinal aspects of reproducibility. One is the duplication of the instrument of research; the other is the duplication of the results themselves, in this case in the form of observations.
Both these metaphors lend themselves well to the demands for accuracy and reproducibility in simulation techniques. The instrumental analog of modeling would be the precise form of the hardware and software used for the study. The observational analog would be the specific method of gathering, processing and presenting the data so that it appears the same to every observer. If we are to accurately reproduce and verify the results of modeling, we need to have standards for addressing both these aspects of data generation, analysis and presentation. Unfortunately, premier journals still have to institute standard policies for code submission. Science for instance requires its authors to supply the actual code while Nature requires only “a description detailed enough to allow others to write their own code to do similar analysis” which can be a recipe for ambiguity. Ironically, as the use of simulation software has proliferated through graphical user interfaces and ease of operation, it has become all too easy for non-expert users to treat the software as a black box and not worry about communicating under-the-hood details. The increasing availability of open-source programs and the undoubtedly propitious employment of simulation by experimentalists makes the matter of having standards and venues for the communication of computational results a particularly urgent one.
Even subtle changes in simulation protocols need to be tracked
First and foremost in this endeavor is a requirement for stating as many details of the hardware and software as possible. An anecdote from my own field of computational chemistry illustrates the reasons for doing this. A past research advisor of mine wanted to replicate a modeling study done with a particularly interesting molecule so he contacted the scientist who had originally performed the study and processed the system according to the former’s protocol. He appropriately adjusted the parameters and ran the experiment. To his surprise he got a very different result. He repeated the protocol several times but consistently saw the wrong result. Finally he called up the original researcher. The two went over the protocol a few times and finally realized that the problem lay in a minor but overlooked detail – the two scientists were using slightly different versions of the modeling software. This wasn’t even a new version, just an update, but for some reason it was enough to significantly change the results.
The anecdote clearly demonstrates a need for making clear every aspect of the software as well as its implementation. This includes versions, updates, patches, operating system, input parameters, ‘pre-processing’ steps and most crucially, expert tweaking. This last aspect is a crucial one and deserves some enumeration. Modeling can be as much of an art as a science. Expert modelers rarely use software in a default, out-of-the-box incarnation. Instead virtually every scientist doing computation tweaks and massages the program to varying extents in order to customize it for the specific system under consideration. Much of this tweaking is based on expert knowledge, experience and intuition and can consist of multiple fixes, including the incorporation of experimental data, the transformation of the algorithm into one used for a similar system in the past, additional parameterization of parts of the algorithm to make it more conducive for the system under consideration and a hodgepodge of other non-standard and often counterintuitive modifications. At least a few of these are inspired by guesswork and intuition, qualities which have always been important players in scientific success. But most tweaks of this kind are never going to be apparent in the original code and not even in its comments.
How then could we incorporate these essential but rather intangible parts of the scientific process into the communication of research results? A partial solution would be to make log files available along with the original code and software specifications. Some commercial programs already have such facilities built into their user interfaces. The drug discovery modeling suite from Schrodinger for instance has, in addition to its usual process log files, a command script editor which keeps track of all the manipulations and minor tweaks that the user performs in the environment of the global user interface. Since multiple software applications can be accessed from this interface, the process valuably keeps track of manipulations both within and across different modules. The command script can in turn be run by anyone with access to the software, and if included with a list of inputs, it should faithfully reproduce the original algorithm. Having researchers upload such scripts along with the original code will allow a more or less accurate duplication of the protocol, compared to instructions in the English language. In a more general sense, there could also be ‘meta-scripts’ which keeps track of all software and hardware details and user manipulations. Such scripts would ideally encapsulate a comprehensive recipe for the entirety of the simulation protocol. In the absence of such global log files, insisting on application-specific log files would be a good start.
Just like software, hardware should also be explicitly described. This is especially true these days when many computations are run on parallel, distributed and cloud-based systems which may well present a varying mix of hardware configurations. An accurate representation of hardware should state details about processors, memory, graphic cards, pre-installed operating systems and monitors and should ideally also include vendor names. Naturally, the amount of detail would depend on the nature of a particular study. For instance, a GPU-intensive study would have to list as many details of the graphic processing card as possible, while a more general study may not be constrained by this requirement. Ultimately, although it would not be possible for every reader to duplicate all hardware details, even knowing that one is using a different configuration alerts one to possible differences in output. What would truly be useful however would be to categorize these differences as a possible guide to future simulations, and this will not be possible until controlled experiments are run in which the results from two simulations differing only in their hardware usage are compared. For a journal specializing in open computational research, there could be a separate section (perhaps online) where users can document interesting differences in results stemming from different hardware architecture.
A particularly valuable piece of information for duplication of computational studies would be the provision of positive and negative controls. Such controls have been a longstanding requirement for publication, especially in the biological community and more generally in experimental science, but the computational community doesn’t seem to have adopted them on a large scale. In case of computational studies, running the algorithm in question on a test system would provide confidence in the integrity of the protocol. The test system could be a well-studied one for which an accurate answer is known – for instance by way of experiment – and has been replicated several times. Any new algorithm that fails to provide this answer would be suspect. Then, even if the real system described in a paper cannot be exactly duplicated by another worker either because of its complexity or due to lack of resources, he or she could have confidence in the workings of the procedure by duplicating its success on the robust test system which would serve as a control. A test system as a negative control would serve a similar purpose.
Good models should be explicitly tagged and showcased.
Cataloging all this information is a challenge in itself, and a possible way to do it would be to build a central repository of code that contains a variety of different protocols along with their attendant parameters. As a recent article in Science suggests, this repository could be called ‘CodeMedCentral’. Just as PubMedCentral provides the details of scientific publications, CodeMedCentral could provide supporting software and hardware details for every published study. Carefully annotating the website and categorizing information on it would be important for accurate reproduction of models. The Science article also suggests tagging articles with a qualifier that indicates the degree of confidence in the study based on whether it was successfully reproduced or not. Studies which have been successfully duplicated could be tagged with a ‘R’ qualifier. This policy has already been fruitfully adopted by the journal Biostatistics, which indicates that the possible stigma of having your article not being tagged is not dissuading eager scientists from publishing their articles in the journal, preferably with an ‘R’ stamp. At the very least, such categorizing of articles according to the extent of code submission and validation will allow researchers to realistically judge the integrity of the study, especially in comparison studies.
We would be remiss in discussing standards for code distribution and reproduction without addressing the problem of proprietary code. Depending on the context, it may or may not be possible to divulge all details of the input and the precise methods of analysis. This is especially the case in industries like the pharmaceutical and aerospace industries. While the constraints on proprietary code inevitably make accurate data duplication difficult, even proprietary data can be amenable to partial reproducibility. In a cheminformatics study for instance, molecular structures which are proprietary could be encoded into special organization-specific formats that are hard to decode. One could still run a set of modeling protocols on this cryptic data set and generate statistics without revealing the identity of the structures. The duplication of the statistics would provide confidence in the integrity of the study. Naturally there will have to be safeguards against the misuse of any such evaluation but they may not be very hard to implement.
Presentation, presentation, presentation.
As important as the accurate description of software and hardware is, it will fail without a uniform standard for the analysis and presentation of data. This is a problem that typically plagues every field at its conception, when different researchers choose to present the results of their study using their favorite statistical technique, data processing software and graphical depiction. Such a presentation may hide as much detail as it reveals since it often relegates outliers, failed results and inherent biases to the side. Not only does this wide variability in analysis and reporting mislead but it also crucially impedes the comparison and meta-analysis of different studies, making it very difficult to vindicate successful techniques and discard unsuccessful ones. What is paramount is the establishment and reporting of standard benchmarks and metrics. A cogent example is provided from the field of virtual screening wherein algorithms are tested for their ability to identify potentially valuable new drug candidates. The confusion arising from the lack of common protocols for setting up test systems and running the relevant algorithms on them was compounded by the dissimilar metrics used for assessment, some of which artificially inflated the significance of the results. It is only in the last few years that we have seen the proposal and use of careful benchmarks for both setting up and assessing virtual screening studies using unbiased statistical metrics,.
It goes without saying that an accurate description of the system under consideration and the methods used for analysis, including a proper presentation of statistical software, methods and error bars, is key to faithful reproduction of the study. In case of statistics-heavy modeling, the results should ideally also include an estimation of model uncertainty, expressed through figures establishing the numerical accuracy of estimates, sensitivity of the model to boundary conditions, confidence intervals etc. The establishment of these standard metrics for reporting studies cannot be overemphasized; without them it would be inherently impossible to separate the true scientific wheat from the chaff of illusory success.
Ultimately the need for making computational details available is no different from the need from making any other theoretical or experimental details available. Modeling and simulation are undoubtedly going to play an increasingly important role in scientific research. Gaining an accurate idea of their pitfalls and promises will be paramount for accurately assessing their place in the pantheon of future scientific inquiry. Reproducibility will illuminate the path.
Acknowledgements: The author wishes to thank Nathan Walsh for helpful discussion.
 Winsburg, E. “Science in the Age of Computer Simulation”, 2010, University of Chicago Press
 Heilbron, J. L. “Galileo”, 2010, Oxford University Press
 Hanson, B.; Sugden, A.; Alberts, B. Science, 2011, 331, 649
 Nature, 2011, 470, 305
 Glass, D. J. “Experimental Design for Biologists”, 2006, Cold Spring Harbor Laboratory Press
 Peng, R. Science, 2011, 334, 1226
 Shoichet, B. K. Nature, 2004, 432, 862
 Hawkins, P. C. D.; Warren, G.; Skillman, A. G.; Nicholls, A. J. Comput. Aided Mol. Des. 2008, 22, 179
 Irwin, J. J. J. Comput. Aided Mol. Des. 2008, 22, 193