About the SA Blog Network



Opinion, arguments & analyses from the editors of Scientific American
Observations HomeAboutContact

Field Tests for Revised Psychiatric Guide Reveal Reliability Problems for 2 Major Diagnoses

The views expressed are those of the author and are not necessarily those of Scientific American.

Email   PrintPrint


(Credit: Ferris Jabr)

PHILADELPHIA—In the summer of 2011 I began working on a feature article about a book that most people have never heard of—the Diagnostic and Statistical Manual of Mental Disorders (DSM), a reference guide for psychiatrists and clinicians. Most of the DSM‘s pages contain lists of symptoms that characterize different mental disorders (e.g. schizophrenia: delusions, hallucinations, disorganized speech and so on). The DSM not only defines mental illness, it often determines whether patients receive treatment—in many cases, insurance companies require an official DSM diagnosis before they subsidize medication or other therapies.

For the first time in 30 years the American Psychiatric Association (APA) is substantially revising the DSM to make diagnoses more accurate and make the book more user-friendly (1994′s DSM-IV did not differ dramatically from 1980′s DSM-III). The association plans to publish a brand new edition of the manual, the DSM-5, in May 2013.

When I was reporting my feature article, published in the May/June issue of Scientific American MIND, I spent a lot of time on the phone with members of the APA Task Force—the group of psychiatrists and researchers who oversee the revisions to the DSM. This weekend I attended the APA’s annual meeting here in Philadelphia to hear some of these researchers speak in person and to learn more about the DSM-5. I was particularly excited about results from the “field trials”—dry runs of the new DSM-5 diagnoses at universities and clinics around the country. The field trials are primarily concerned with one question: do different psychiatrists using the revised DSM-5 diagnoses reach the same conclusion about the same patient? If they do, the updated lists of symptoms have high “reliability”—a good thing in medicine. If not, the new diagnoses are unreliable and the revisions are a failure.

The APA has not yet published the results of the field trials, but at the annual meeting in Philly the association gave a preview of the findings during a Saturday symposium. It was a first glimpse at extremely important data that many people have been waiting a long time to see.

Some of the results—and the way in which the speakers presented them—frustrated and concerned me.

To understand why, it’s helpful to first discuss some statistics. I’ll keep it simple. The APA uses a statistic called kappa to measure the reliability of different diagnoses. The higher the value of kappa, the more reliable the diagnosis, with 1.0 representing perfect reliability. The APA considers a diagnosis with a kappa of 0.8 or higher miraculously reliable; 0.6 to 0.8 is excellent; 0.4 to 0.6 is good; 0.2 to 0.4 “could be accepted” and anything below 0.2 is unacceptably unreliable. Low reliability is a big problem for clinicians, patients and researchers alike: it means that only a minority of clinicians agree when diagnosing a disorder and that researchers who want to study a particular disorder will have a very hard time identifying participants who truly have the disorder in question. If no one agrees, it is hard to make progress of any kind.

Darrel Regier, vice chair of the APA’s DSM-5 Task Force, presented kappas for various DSM-5 diagnoses—the first publicly released results from the field trials. Fortunately, the kappas for many of the DSM-5 diagnoses look strong. Field trials of the new autism spectrum disorder (ASD), for example—which collapses DSM-IV diagnoses for autistic disorder, Asperger’s and other developmental conditions into one category—yielded a kappa of 0.69. However, two pitiful kappas shocked me. The kappa for generalized anxiety disorder was about 0.2 and the kappa for major depressive disorder was about 0.3.

These numbers are way too low according to the APA’s own scales—and they are much lower than kappas for the disorders in previous versions of the DSM. Regier and other members of the APA emphasized that field trial methodology for the latest edition is far more rigorous than in the past and that kappas for many diagnoses in earlier editions of the DSM were likely inflated. But that doesn’t change the fact that the APA has a problem on its hands: its own data suggests that some of the updated definitions are so flawed that only a minority of psychiatrists reach the same conclusions when using them on the same patient. And the APA has limited time to do something about it.

Although the APA has been working on the DSM-5 for more than 11 years now, field trials only started within the last year. While reporting my feature, I asked members of the APA why they waited so long to conduct the field trials. After all, only one year remains until scheduled publication of the DSM-5 and we still do not know whether the revised diagnoses are reliable and whether they are a genuine improvement over their predecessors. I never received a satisfactory answer

To make an analogy, consider a baker who spends months developing a recipe for the ultimate chocolate cake in his head and—a day before he has to deliver the cake—finally tries out the recipe only to discover that the cake tastes awful. He has one day to come up with something else. The APA has placed itself in a similarly desperate position. The final drafts of the new manual are due December of this year, which means the APA has less than 8 months to implement what it has learned from the field trials if it wants to publish on schedule. New field trials would take years to arrange and at least one additional year to conduct. Either the association delays publication of the DSM-5 for several more years, revises the diagnoses yet again and conducts new field trials—or it goes forward with the current schedule and publishes a significantly flawed DSM-5.

If the APA has a plan of action—beyond vague statements like “continuing to analyze our data”—the association did not make it clear at the symposium. The presenters hardly seemed troubled by the alarming results. Even worse, they sometimes came off as oblivious.

Eve Moscicki of the American Psychiatric Institute for Research and Education gave the final presentation in the symposium. Moscicki helped coordinate the field trials in clinics. For some reason, Moscicki decided to spend more than half her allotted time on irrelevant details—such as the benefits of a good technical support team—before getting to the actual field trial results. Finally she pulled up some colorful bar graphs showing what clinicians and patients thought about the new DSM-5 diagnoses. The bars showed what percentage of respondents thought that the new definitions were Extremely Useful, Very Useful, Moderately Useful, Slightly Useful or Not at All Useful. Infographic enthusiasts know that bar graphs are a weak way to present data like this—it’s difficult to make visual comparisons across so many categories at the same time. A pie chart would have been much clearer. **(See Edited to Add below for corrections and clarification).**

“Well, yes, it looks to me like the majority thought it was very or extremely useful,” Moscicki said of the one revised diagnoses.

“That’s incorrect,” I said, standing up. “37 percent plus 7 percent does not equal more than 50 percent.” In fact, the majority of respondents thought that the new criteria were somewhere between moderately to not at all useful. “You can’t present this data as a bar graph. It’s deceptive,” I added. It was the third time that Moscicki had made such a mistake, overestimating the percentage of positive responses and glossing over the DSM-5‘s shortcomings apparent in the results.

“Well, umm, just remember this is a first look…”

“Totally deceptive,” I said. I swung my backpack over one shoulder and walked out of the room.

In retrospect, I should not have called the graph deceptive, although I do still think that the data was poorly presented. I wish I had stuck around for the final minutes of the presentation, but I was too upset to remain in the room any longer. Perhaps I overreacted. After reflecting on the experience, however, I remain genuinely concerned about the future of the DSM.

Moscicki is right about one thing: this is just a first look. Until the APA officially publishes the results of the field trials, nobody outside the association can complete a proper analysis. What I have seen so far has convinced me that the association should anticipate even stronger criticism than it has already weathered. In fairness, the APA has made changes to the drafts of the DSM-5 based on earlier critiques. But the drafts are only open to comment for another six weeks. And so far no one outside the APA has had access to the field trial data, which I have no doubt many researchers will seize and scour. I only hope that the flaws they uncover will make the APA look again—and look closer.

**Edited to Add**

A few people have pointed out that a pie chart is not necessarily clearer than a bar graph when it comes to presenting the data I discussed. That’s true. I realize now I did not explain my meaning correctly. What bothered me is that Moscicki was guesstimating. She was eyeballing the percentages represented by different bars and adding them together in her head to see if, combined, the Very and Extremely useful percentages were greater than the rest of the categories. Instead, she should have graphically combined the data into two categories for clear comparison—whether as two wedges in a pie chart or as two bars—before her presentation. The solution that popped into my mind at the time was a pie chart in which the wedge representing the combined Very and Extremely useful percentages was clearly less than half of the pie and the wedge representing the combined Moderately, Slightly and Not at All useful categories was clearly more than half. In the grand scheme of things, this particular point is a quibble—but it was the straw that broke the camel’s back. My frustration had been building throughout the symposium and I could not stand for what I perceived as glib treatment of crucial data.

About the Author: Ferris Jabr is an associate editor focusing on neuroscience and psychology. Follow on Twitter @ferrisjabr.

The views expressed are those of the author and are not necessarily those of Scientific American.

Rights & Permissions

Comments 12 Comments

Add Comment
  1. 1. scicurious 2:47 pm 05/6/2012

    This definitely sounds problematic. I was wondering, though, what were the kappas for GAD and MDD for the DSM-IV? I can only find III-R for GAD, which has it listed at 0.53. One source has MDD listed at 0.753 (, but I can’t find an original source. If these kappa values are so much higher, why are they revising these particular diagnoses?

    Link to this
  2. 2. meschaeffer 3:54 pm 05/6/2012

    Thank you for pointing out the problems with Moscicki’s data analysis and with the field trials in general. These low values are incredibly concerning for disorders like GAD and MDD because they’re so prevalent in the population – I think sometimes people forget that the abstract notions in these manuals affect real people in terms of diagnostics, treatment, and insurance coverage.

    Link to this
  3. 3. jchristi 4:49 pm 05/6/2012

    Sounds like the graphic was less than ideal. And that a pie chart would’ve shown parts of a whole more clearly (with some categories adding up to be visibly less than half). But I take issue with this statement: “Infographic enthusiasts know that bar graphs are a weak way to present data like this—it’s difficult to make visual comparisons across so many categories at the same time.”

    Generally bars are much easier to compare across categories than wedges. For a really rational and comprehensive discussion on the strengths and weaknesses of pie charts, check out

    Link to this
  4. 4. rosabw 7:16 pm 05/6/2012

    I’m surprised they are so close for Aspergers, beings as Bipolar, ADD, LD, Dyslexia, Autism, are so frequently intertwined. Or maybe they are all Aspergers,now. We are all Aspergers, now, come to think of it…

    Link to this
  5. 5. ejwillingham 8:59 pm 05/6/2012

    I’m curious about what that bar graph suggested about how patients and clinicians feel about the ASD revisions because what I’m seeing in the community is closer to “not at all useful” or even “harmful” than anything else.

    Link to this
  6. 6. Rock LeBateau 4:42 am 05/7/2012

    What has happened to evidence based medicine? I would have thought that if your physician cannot PROVE that you are suffering from a generalised anxiety disorder, or a major depressive order, then any treatment can be interpreted as common assault. (bring on the lawyers)

    Link to this
  7. 7. notscientific 4:55 am 05/7/2012

    I am glad that in retrospect you realised that leaving the room was not the best option. While you appeared to have made your point during the presentation, staying until the end of the presentation and attending question-time would have allowed you to further communicate your point and may even have allowed for some interesting discussion with other members of the audience.

    Link to this
  8. 8. suzychapman 8:40 am 05/7/2012

    @ scicurious

    I can’t provide you with original sources, but the following kappas for GAD and MDD (and a number of others) have been published, today, by Allen Frances MD, who chaired the Task Force for DSM-IV:

    GAD .2 .65. .30. .72

    MDD .32. .59. .53. .80

    Anyone got results for the SSDs, CSSD and SSSD?

    Link to this
  9. 9. Surfernova 10:34 am 05/7/2012

    The irony here is palpable. The working definition of psychological health is “being in touch with REALITY”.
    This situation is, at it’s core, an issue in understanding leadership, especially in the Reality/Courage realm of good project management.

    Link to this
  10. 10. julianpenrod 8:41 pm 05/7/2012

    In fact, “psychiatry”, “psychology”, “psychoanalysis” all have traits that utterly disqualify them much less than as “scientific”, but even as useful. Consider that all “psychoanalytic theories” are definitively different, if not wholly incompatible. Freud’s emphasis on sex, Fromm saying a desire for conformity and programming defines man, Skinner saying everything is already programmed, Jung invoking already defined stereotypes. And all supposedly describing the same human mand! And yet no devotees of “science” attack “psychology” as a fraud! They respect the money, not ethics! Just as in the case of the single most accusatory quality of “psychoanalysis”. Claiming to be a modeling of all human minds, “psychoanalytic theories” all derive solely from case histories! All “psychoanalytic theory” is based on individuals with proven mental instabilities, yet, they try to generalize from these to healthy minds! No “psychoanalytic theory” recognized by “science” derived from interviewing sane people!

    Link to this
  11. 11. Jim Lacey 11:49 am 05/10/2012

    The treatment of mental disorders has surely improved over the last two centuries. Compassionate treatment, talk therapy, the use of drugs, and even shock, have enabled many who previously would have been institutionalized to lead fulfilling lives in society. That said, such treatment and diagnoses all too often seem hit-or-miss affairs. One patient’s problem has been (in my experience) diagnosed differently by three therapists–as schizophrenia, as bi-polar disorder, and as depression. All of them combined reassuring talk with whatever magic pill was popular at the moment. The patient had been suffering brief psychotic episodes about every seven years, but has been fine for the last twenty, ten of them drug free. It did not make any difference, it seemed, whether a psychiatrist, a psychologist, or a therapist (whatever that may be) was being consulted. Psychological therapy seems to lack consistent basic theory or even agreement about what sort of abnormality is being treated. It seems as questionable as medical diagnolses before the discovery of bacteria and viruses.

    Link to this
  12. 12. stan e m 4:09 pm 05/14/2012

    Mental illness is probably caused by food poison like bmaa,mercury.Sea food can’t be trusted because of pollution.

    Link to this

Add a Comment
You must sign in or register as a member to submit a comment.

More from Scientific American

Email this Article