How to generate realistic, non-personal synthetic health data?

4 septembre - 15h00 - 16h

Centre de recherche - Paris

Amphithéâtre Hélène Martel-Massignac (BDD)

11 rue Pierre et Marie Curie - 75005 Paris

Description

A person's health data are according to GDPR guided by strict governance and access rules. To share such data to a wider audience for educational, research, and development purposes, it has been suggested to generate synthetic data, i.e., anonymous data (which per definition are outside the GDPR) that preserves the properties of the real dataset.

Synthetic data are typically obtained by sampling from a noisy model of the data.  There is a clear privacy/utility tradeoff for such data. If we add much noise, the privacy is high, but the utility is low and if we add no noise the privacy is low, but the utility is high. Thus, there is a clear need to establish ways of measuring the privacy of synthetic data generation. Differential privacy is considered the state-of-the-art privacy property. However, this measure does not directly measure the individual's privacy risk. As an alternative, Bayesian privacy has been suggested which measures the individual's posterior risk given the synthetic data and the generating mechanism. However, current implementations scale poorly both regarding to construction and computation time. In this talk, the notation of differential and Bayesian privacy will be defined and there will be a discussion on the open problems in large scale adoption of Bayesian privacy measures. This work is a part of the project "Synthetic Health Data: Ethical Development and Deployment via Deep Learning approaches (SE3D)" supported by the Nordisk Foundation's Data Science Collaborative Research Programme.

Orateurs

Martin Bøgsted

Aalborg University and Aalborg University Hospital (Denmark)