Darwin's sketch of an evolutionary tree under the heading "I think" is a powerful and enduring image of his theory evolution by natural selection. Phylogenetic trees--branching diagrams that show the relationships between organisms and their evolution from a common ancestor--are now a standard image in biology texts used to situate an organism in biological space and time. I make phylogenetic trees in my research often, comparing DNA sequences from different bacterial strains to better understand the relationships between species. Like most biologists, I'm not quite a power user or a taxonomist, so I usually interact with the different methods of sequence comparison and tree building as choices in a drop-down menu, comparing the different trees using the statistical measures offered by the programs. Without fossil specimens for every evolutionary transition, trees generated using different algorithmic methods have to be assessed with statistics rather than with comparison to the "true" tree. But what if you could build a synthetic tree of imaginary organisms, with known evolutionary relationships between each branch to test your algorithms against? Meet the Caminalcules:

In order to assess and teach different methods of building phylogenetic trees, taxonomist Joseph Camin designed a set of adorable imaginary animals in the early 1960's. The animals, playfully referred to as "Caminalcules" by his graduate students, had a pre-defined evolutionary history that was reflected in the shapes and patterns of Caminalcule phenotypes. The set of 77 Caminalcules includes 29 living species and 48 "fossil" species, allowing a full reconstruction of the evolutionary tree. Students could test out their newly acquired skills of classification on this synthetic data set, comparing their results against the "true" evolutionary history of the answer key. Beyond its utility as a teaching tool, the set of Caminalcules also allowed for the development and testing of new kinds of classification schemes, particularly new numerical methods and algorithms.

In a 1966 article in Scientific American (PDF), entomologist Robert Sokal discusses his work on computational systems that can sort and classify organisms and the ways that he used the Caminalcules to help develop new numerical methods. To Sokal, traditional methods of taxonomy were comparatively more "subjective," requiring the classifier to identify phenotypic characteristics and organize evolutionary trees by hand and "making taxonomy more of an art than a science." The emergence of the computer during the 1960's provided "many possibilities for objective and explicit classification."

Today the "digital" data held in gene sequences can be compared using algorithmic methods of alignment and clustering, but in the 1966 there weren't any gene sequences available. Instead, Sokal used numerical and automated methods to compare the "analog" physical characteristics of the organisms using digital programs. One method of automated image processing that Sokal developed to convert variable phenotypic information into numerical data was to simply cover the Caminalcule line drawings with punchcards that had random holes punched out. Each hole would then be assigned a "1" or a "0" depending on whether there was a line drawn under that hole. Comparison of these low-resolution digitizations of the different Caminalcules were able to generate trees similar to the original phylogeny.

These punchcard images are a fascinating artifact of early computational biology, anticipating a very different future than what we have today, a future based not on gene sequence but the automation of phenotypic characterization. Indeed, in his Scientific American article Sokal writes:

Most prominent among the devices likely to be useful in taxonomy are optical scanners, which digitize drawings, photographs, microscope preparations and results of biochemical analysis. The veritable flood of information that will flow from these automatic sensors will require computer-based processing and classification, since the human mind is not able to digest these data by traditional means.

Today when we talk about the flood of digital data we're usually referring to petabytes of genome data coming from sequencing centers and overwhelming our computational capacity to analyze and interpret that information. For Sokal, however, it was "by no means certain whether genes or their effects should form the basis of a classification," and even today, taxonomists look at a lot more than just gene sequence to classify organisms. Different kinds of phenotypic data might seem like another drop in the already flooded bucket, but biology is more than DNA, and the history of classification shows us that we need much more than sequences to organize and understand life.