October 1, 2013

Dealing with the unexpected: William Kahan and the need for graceful exits

In a previous post, I took a trip down read-only memory lane with William Kahan. On the same trip, Kahan told me about his first commercial programming job.

By Markus Pössel

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

In a previous post, I took a trip down read-only memory lane with William Kahan. On the same trip, Kahan told me about his first commercial programming job. The software was meant to aid the operations of Trans-Canada Air Lines (now Air Canada) by helping it save on, wait for it: telegraph charges. To this end, it was to keep track of the seats available on various flights, the better to coordinate reservations (presumably cutting down on unnecessary inquiries by telegraph back and forth).

Writing the part of the software that took care of the seating arrangements was the easy part. The hard part was that the system would run error-free for 5 minutes at most - and that Kahan had to prepare for a demonstration that, unsurprisingly, was slated to last significantly longer than 5 minutes. That is why most of the program turned out to be taken up by safeguards, check sums and recovery procedures. With those, the program would work error-free for a much longer time.

The safeguards were a success. As Kahan remembers, only one other person, Harvey Gellman, managed to get useful work from that computer when it had one of its cranky days.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

With this history, it's not a surprise that Kahan has always had an eye out for all that can go wrong with computer programs. And he has come to the conclusion that, at this time, the computing world in general is not handling the problem very well.

What, after all, happens when a program, or part of a program, encounters an unforeseen condition? Some sensor reading not anticipated by the programmer? An unexpected floating point value? (I have written about some of Kahan's examples for truly weird floating-point behavior in Built-in errors: William Kahan and floating-point arithmetic.)

Unless someone has foreseen that particular kind of problem, and taken pains to prepare for it - as Kahan did in his airline reservation project - the convention is for the program to be aborted and, in effect, abandoned.

In the 1960s, with programs running in batches one after the other, that might have been acceptable. After all, if one program crashed, the program that was next in line would start up and do its stuff. Having programs control complex operations, with input from sensors, running equipment, with the constant need for something to stay in control, these same conventions can be downright destructive.

After all, they may mean that at some unspecified point in the program, control will jump to some unspecified place. And there is no guarantee that the system will know how to deal with this sudden, haphazard change of control.

This, Kahan stresses, is not some academic exercise. For him, the most striking example is Air France Flight 447, which crashed on June 1, 2009. When the plane's air-speed sensors froze over and, in consequence, gave inconsistent readings, the autopilot did exactly what Kahan criticizes: It aborted, handing control over to, in this case, the human pilots in a way that did not tell them anything about what was going on. The pilots did not have sufficient time to make inquiries, since the plane stalled and, minutes later, hit the surface of the Ocean, disintegrating on impact and killing all 288 men and women aboard.

Another example is the USS Yorktown. Quoting from this text (PDF) by Kahan:

On 21 Sept. 1997, the Yorktown was maneuvering off the coast of Cape Charles, VA, when a crewman accidentally ENTERed a blank field into a data base. The blank was treated as a zero and caused a Divide-by-Zero Exception which the data-base program could not handle. It aborted to the operating system, Microsoft Windows NT 4.0, which crashed, bringing down all the ship’s LAN consoles and miniature remote terminals.

The result: A fairly large combat vessel paralyzed for almost 3 hours. Luckily, neither in combat nor in dangerous waters.

Then, there is the failure of the Ariane 5 flight on June 4, 1996. Part of the navigational system encountered conditions it could not handle. That part of the system (tasked with keeping track of the rocket's changes in speed and orientation) then shut down completely, and the rocket crashed.

So can't people just anticipate what might go wrong, and write code that takes care of things? That's easier said than done. Kahan points to research by Westley Weimer, who (here with George Necula) analyzed millions of lines of (Java) code and found hundreds of mistakes in handling errors - mostly error-handling procedures that failed to "clean up" behind them properly.

Kahan sees this as a fundamental flaw in (most) current programming languages. He argues that languages should be written in a way to force programmers to implement proper procedures for handling unsuspected situations. If a part of a program crashes, it should hand over control to whatever other program called it, reporting on what went wrong. And the part of the program that receives the report should be required to contain some provisions for handling the situation.

Languages that are being compiled - the compiler being software that translates the lines of written code into a file one can run on the computer, catching some programming errors in the process -, Kahan argues, should require the programmer to state what happens in each instance of some part of the program being handed control by the failing of another. If the programmer can come up with some better way of anticipating the unexpected, a better default, then Kahan is willing to be corrected - but, he argues, there has to be a default, some predictable behavior by the program if the programmer has not specified anything else.

It's difficult to effect such fundamental change. Retrofitting all older software is out of the question (even something much simpler, the Y2K problem, took $300 billion to fix). The best you can hope is to get future versions of existing programming languages to implement safer hand-offs of control. Even then, you are likely to run up against the different philosophies of those designing the language (a major reason for the variety of programming languages in the first place). When it comes to the question of what to leave up to the programmers, and what to enforce via compiler, different language designers are likely to come to different conclusions.

Kahan quotes one whom he thinks typical: "You're talking about rescuing programmers from their own mistakes when they don't use best possible practice."

Perhaps (I thought later on) standards for software controlling critical systems such as aircraft or missile cruisers should just be defined by law - as, presumably, are the standards that govern the basics of building construction an other life-or-death-critical activities.

At least Vint Cerf, this time not with his Google VP or Internet pioneer hats on, but as president of the Association for Computing Machinery, promised to follow up on the issue right after Kahan's talk at the HLF on Thursday. We'll see how it goes. For now, Kahan remains something of a prophet in the desert.

.....

This blog post originates from the official blog of the 1st Heidelberg Laureate Forum (HLF) which takes place September 22 - 27, 2013 in Heidelberg, Germany. 40 Abel, Fields, and Turing Laureates will gather to meet a select group of 200 young researchers. Markus Pössel is a member of the HLF blog team. Please find all his postings on the HLF blog.