Thursday, June 22, 2017 - 14:00 to 15:30
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification
When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our methods to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our method enables the creation of privacy-preserving classifiers with optimal prediction accuracy.
Fabian Prasser's picture
Fabian Prasser
Johanna Eicher's picture
Johanna Eicher
Raffael Bild's picture
Raffael Bild
Helmut Spengler's picture
Helmut Spengler
Klaus Kuhn's picture
Klaus Kuhn