Over time, patients end up providing a wealth of information to their health care providers, and when all our data are aggregated, they are also a boon to researchers studying trends in diseases and demographics for clues in how to better treat illness. And nowadays, as more patient health care records go digital, patient information becomes more widely shared among researchers—which can be a good thing or a bad thing, depending upon who has access to it.

Electronic medical record (EMR) systems contain detailed, yet anonymous patient-level data represented in codes that correspond to different health conditions, including disease, symptom or injury. Lately, EMRs are increasingly being used to provide data for genome-wide association studies (GWAS) used to identify relationships among specific genomic variants and health-related phenomena, a key to delivering on the promise of personalized medicine. However, patient privacy can be threatened when personal information is linked to genetic information using codes that are available through public databases and electronic medical records, a team of Vanderbilt University researchers in Nashville conclude in a study published Monday in the Proceedings of the National Academy of Sciences.

The researchers claim to have illustrated this problem as part of their research, where they identified 96 percent of a group of 2,762 patients with the help of the diagnosis codes in the patients' records.

A possible solution, according to Vanderbilt researchers Grigorios Loukides, Aris Gkoulalas-Divanis and Bradley Malin, is to use a method for creating anonymous records that replaces the current system—known as the International Statistical Classification of Diseases and Related Health Problems (ICD)—with a series of related codes. The researchers created an algorithm that generalizes clinical information so that patients remain anonymous, while providing the medical and genetic connections needed by researchers.

Loukides and his colleagues tested the algorithm's data protection performance against simulated malicious computer hacker attacks using actual information from more than 2,600 patients, assuming a potential hacker knew a patient's identity, some or all of a patient's ICD codes, and whether the patient record was included in released data. The technique foiled attempts to uncover a patient's private information, the researchers wrote, and maintained the data integrity necessary to retain useful information for validating genome-wide studies.

Image ©iStockphoto.com/ DNY59