AI algorithms and intelligent systems hold great promise in the health sector, such as decreasing the cost of supporting an aging population, discovering new therapeutic avenues for complex disease such as cancers, and better patient management. To support future advances in the Canadian healthcare system with data intelligent algorithms, barriers to data access and privacy must be addressed, and many scientists are turning toward generative models and synthetic datasets for innovative solutions.
On November 25, CIFAR, IVADO and Mila will host a symposium featuring Canadian and international experts to address the opportunities and challenges around the use and deployment of synthetic data in healthcare.
A private workshop will be held on Nov. 26, by invite-only, with the goal of facilitating discussions and collaborations between academic researchers and private sector or hospital partners. If you are interested in attending, please reach out to our Events Team.
Raymond Ng, University of British Columbia
Kahled El Emam, University of Ottawa; Replica Analytics
Flora Jay, Université Paris Saclay/CNRS
Aurélien Decelle, Universidad Complutense de Madrid
Blake Richards, Mila, McGill University
Guillaume Lajoie, Mila, Université de Montréal
11:00am – 11:05am
Opening Remarks & Land Acknowledgement
Elissa Strome, CIFAR
11:05am – 11:40am
Keynote: Synthetic data generation for privacy-preserving data releases in health care
Raymond Ng, Professor and Director, Data Science Institute, University of British Columbia,
Rob Bergen and Jean-Francois Rajotte, Data Science Institute, University of British Columbia
As part of a research partnership between the Provincial Health Services Authority in BC and the UBC Data Science Institute, we have a program that explores how to provide health data releases in privacy-preserving ways. In this talk, we will give an overview of this program. We will present a recently developed framework for generating 3D PET images. We will also show how to conduct synthetic data generation in a federated learning style, with which collaborators do not need to share local training data. Finally, we will discuss the importance of measuring membership inference attacks on synthetic data.
11:40am – 12:00pm
Practical Experiences with the Development and Deployment of Synthetic Data Generation Technologies
Khaled El Emam, Professor, University of Ottawa; Co-Founder and CEO Replica Analytics
We have been developing synthetic data generation (SDG) tools to enable the sharing of health data and to perform simulations. At the same time we have been deploying these tools in practice within public and private sector organizations globally. The transition of SDG into practice requires solving basic problems such as identifying and validating meaningful utility metrics for training, tuning, and communicating about SDG models. Real health data is longitudinal with many complex patterns that need to be modeled and SDG solutions need to account for these. And one of the first questions that comes up in the context of SDG is how privacy risks can be managed. This presentation will cover some of the practical issues with SDG and how we have addressed them.
12:00pm – 12:15pm
12:15pm – 12:35pm
Improving nervous system interfacing with generative adversarial data synthesis
Blake Richards, Core Member/Assistant Professor, Mila/McGill
Guillaume Lajoie, Assistant Professor, Mila & UdeM
Simulated datasets of neural recordings are a crucial tool in neural engineering for testing the ability of decoding algorithms to recover known ground-truth. In this work, we introduce PNS-GAN, a generative adversarial network capable of producing realistic nerve recordings conditioned on physiological biomarkers. PNS-GAN operates in the wavelet domain to preserve both the timing and frequency of neural events with high resolution. PNS-GAN generates sequences of scaleograms from noise using a recurrent neural network and 2D transposed convolution layers. PNS-GAN discriminates over stacks of scaleograms with a network of 3D convolution layers. We find that our generated signal reproduces a number of characteristics of the real signal, including similarity in a canonical time-series feature-space, and contains physiologically related neural events including respiration modulation and similar distributions of afferent and efferent signalling.
12:35pm – 12:55pm
Creating artificial human genomes using generative neural networks
Flora Jay, CR CNRS researcher, CNRS, University Paris-Saclay, LISN
Aurélien Decelle, Researcher, Universidad Complutense de Madrid
Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.
Joint work with: Burak Yelmen, Aurélien Decelle, Linda Ongaro, Davide Marnetto, Corentin Tallec, Francesco Montinaro , Cyril Furtlehner, Luca Pagani, Flora Jay
12:55pm – 1:15pm
Synthetic Data: Opportunities and Challenges for Clinical Research
David Buckeridge, Professor / Chief Digital Health Officer, McGill University / MUHC
Machine learning methods generally require large amounts of detailed data to train models. However, it is difficult for machine learning researchers to access and manage clinical data due to privacy concerns and other barriers. In this context, synthetic data presents opportunities for improving access to realistic data, possibly increasing the pace and scale of machine learning research using clinical data. The use of synthetic data is not without challenges, however. For example, available synthesis methods are not well suited to all data types and may not provide guarantees of privacy preservation. This presentation will review these and other opportunities and challenges to identify promising applications for the use of synthetic clinical data.
1:15pm – 1:30pm
Wrap up & Closing Remarks