May 6, 2012

Field Tests for Revised Psychiatric Guide Reveal Reliability Problems for 2 Major Diagnoses

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

PHILADELPHIA—In the summer of 2011 I began working on a feature article about a book that most people have never heard of—the Diagnostic and Statistical Manual of Mental Disorders (DSM), a reference guide for psychiatrists and clinicians. Most of the DSM's pages contain lists of symptoms that characterize different mental disorders (e.g. schizophrenia: delusions, hallucinations, disorganized speech and so on). The DSM not only defines mental illness, it often determines whether patients receive treatment—in many cases, insurance companies require an official DSM diagnosis before they subsidize medication or other therapies.

For the first time in 30 years the American Psychiatric Association (APA) is substantially revising the DSM to make diagnoses more accurate and make the book more user-friendly (1994's DSM-IV did not differ dramatically from 1980's DSM-III). The association plans to publish a brand new edition of the manual, the DSM-5, in May 2013.

When I was reporting my feature article, published in the May/June issue of Scientific American MIND, I spent a lot of time on the phone with members of the APA Task Force—the group of psychiatrists and researchers who oversee the revisions to the DSM. This weekend I attended the APA's annual meeting here in Philadelphia to hear some of these researchers speak in person and to learn more about the DSM-5. I was particularly excited about results from the "field trials"—dry runs of the new DSM-5 diagnoses at universities and clinics around the country. The field trials are primarily concerned with one question: do different psychiatrists using the revised DSM-5 diagnoses reach the same conclusion about the same patient? If they do, the updated lists of symptoms have high "reliability"—a good thing in medicine. If not, the new diagnoses are unreliable and the revisions are a failure.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

The APA has not yet published the results of the field trials, but at the annual meeting in Philly the association gave a preview of the findings during a Saturday symposium. It was a first glimpse at extremely important data that many people have been waiting a long time to see.

Some of the results—and the way in which the speakers presented them—frustrated and concerned me.

To understand why, it's helpful to first discuss some statistics. I'll keep it simple. The APA uses a statistic called kappa to measure the reliability of different diagnoses. The higher the value of kappa, the more reliable the diagnosis, with 1.0 representing perfect reliability. The APA considers a diagnosis with a kappa of 0.8 or higher miraculously reliable; 0.6 to 0.8 is excellent; 0.4 to 0.6 is good; 0.2 to 0.4 "could be accepted" and anything below 0.2 is unacceptably unreliable. Low reliability is a big problem for clinicians, patients and researchers alike: it means that only a minority of clinicians agree when diagnosing a disorder and that researchers who want to study a particular disorder will have a very hard time identifying participants who truly have the disorder in question. If no one agrees, it is hard to make progress of any kind.

Darrel Regier, vice chair of the APA's DSM-5 Task Force, presented kappas for various DSM-5 diagnoses—the first publicly released results from the field trials. Fortunately, the kappas for many of the DSM-5 diagnoses look strong. Field trials of the new autism spectrum disorder (ASD), for example—which collapses DSM-IV diagnoses for autistic disorder, Asperger's and other developmental conditions into one category—yielded a kappa of 0.69. However, two pitiful kappas shocked me. The kappa for generalized anxiety disorder was about 0.2 and the kappa for major depressive disorder was about 0.3.

These numbers are way too low according to the APA's own scales—and they are much lower than kappas for the disorders in previous versions of the DSM. Regier and other members of the APA emphasized that field trial methodology for the latest edition is far more rigorous than in the past and that kappas for many diagnoses in earlier editions of the DSM were likely inflated. But that doesn't change the fact that the APA has a problem on its hands: its own data suggests that some of the updated definitions are so flawed that only a minority of psychiatrists reach the same conclusions when using them on the same patient. And the APA has limited time to do something about it.

Although the APA has been working on the DSM-5 for more than 11 years now, field trials only started within the last year. While reporting my feature, I asked members of the APA why they waited so long to conduct the field trials. After all, only one year remains until scheduled publication of the DSM-5 and we still do not know whether the revised diagnoses are reliable and whether they are a genuine improvement over their predecessors. I never received a satisfactory answer

To make an analogy, consider a baker who spends months developing a recipe for the ultimate chocolate cake in his head and—a day before he has to deliver the cake—finally tries out the recipe only to discover that the cake tastes awful. He has one day to come up with something else. The APA has placed itself in a similarly desperate position. The final drafts of the new manual are due December of this year, which means the APA has less than 8 months to implement what it has learned from the field trials if it wants to publish on schedule. New field trials would take years to arrange and at least one additional year to conduct. Either the association delays publication of the DSM-5 for several more years, revises the diagnoses yet again and conducts new field trials—or it goes forward with the current schedule and publishes a significantly flawed DSM-5.

If the APA has a plan of action—beyond vague statements like "continuing to analyze our data"—the association did not make it clear at the symposium. The presenters hardly seemed troubled by the alarming results. Even worse, they sometimes came off as oblivious.

Eve Moscicki of the American Psychiatric Institute for Research and Education gave the final presentation in the symposium. Moscicki helped coordinate the field trials in clinics. For some reason, Moscicki decided to spend more than half her allotted time on irrelevant details—such as the benefits of a good technical support team—before getting to the actual field trial results. Finally she pulled up some colorful bar graphs showing what clinicians and patients thought about the new DSM-5 diagnoses. The bars showed what percentage of respondents thought that the new definitions were Extremely Useful, Very Useful, Moderately Useful, Slightly Useful or Not at All Useful. Infographic enthusiasts know that bar graphs are a weak way to present data like this—it's difficult to make visual comparisons across so many categories at the same time. A pie chart would have been much clearer. **(See Edited to Add below for corrections and clarification).**

"Well, yes, it looks to me like the majority thought it was very or extremely useful," Moscicki said of the one revised diagnoses.

"That's incorrect," I said, standing up. "37 percent plus 7 percent does not equal more than 50 percent." In fact, the majority of respondents thought that the new criteria were somewhere between moderately to not at all useful. "You can't present this data as a bar graph. It's deceptive," I added. It was the third time that Moscicki had made such a mistake, overestimating the percentage of positive responses and glossing over the DSM-5's shortcomings apparent in the results.

"Well, umm, just remember this is a first look…"

"Totally deceptive," I said. I swung my backpack over one shoulder and walked out of the room.

In retrospect, I should not have called the graph deceptive, although I do still think that the data was poorly presented. I wish I had stuck around for the final minutes of the presentation, but I was too upset to remain in the room any longer. Perhaps I overreacted. After reflecting on the experience, however, I remain genuinely concerned about the future of the DSM.

Moscicki is right about one thing: this is just a first look. Until the APA officially publishes the results of the field trials, nobody outside the association can complete a proper analysis. What I have seen so far has convinced me that the association should anticipate even stronger criticism than it has already weathered. In fairness, the APA has made changes to the drafts of the DSM-5 based on earlier critiques. But the drafts are only open to comment for another six weeks. And so far no one outside the APA has had access to the field trial data, which I have no doubt many researchers will seize and scour. I only hope that the flaws they uncover will make the APA look again—and look closer.

**Edited to Add**

A few people have pointed out that a pie chart is not necessarily clearer than a bar graph when it comes to presenting the data I discussed. That's true. I realize now I did not explain my meaning correctly. What bothered me is that Moscicki was guesstimating. She was eyeballing the percentages represented by different bars and adding them together in her head to see if, combined, the Very and Extremely useful percentages were greater than the rest of the categories. Instead, she should have graphically combined the data into two categories for clear comparison—whether as two wedges in a pie chart or as two bars—before her presentation. The solution that popped into my mind at the time was a pie chart in which the wedge representing the combined Very and Extremely useful percentages was clearly less than half of the pie and the wedge representing the combined Moderately, Slightly and Not at All useful categories was clearly more than half. In the grand scheme of things, this particular point is a quibble—but it was the straw that broke the camel's back. My frustration had been building throughout the symposium and I could not stand for what I perceived as glib treatment of crucial data.