Return to Seminars

December, 12, 2019, 2:00 PM: Carlos SAEZ (Universitat Politècnica de València)

December, 12, 2019, 2:00 PM, Room 3/124, Bat 5.

TitleProbabilistic methods for multi-source and temporal biomedical data quality and variability assessment: a review and case studies
Abstract: Biomedical Data repositories are becoming bigger both in terms of sample size and number of variables. Two significant reasons behind of this are the widespread adoption of data-sharing initiatives and technological infrastructures, and the continuous and systematic population of those repositories over long periods of time. However, these two situations can also introduce potential confounding factors in data which may hinder their reuse, such as for population research or machine learning. Concretely, differences in protocols, populations, or even unexpected biases or errors, either caused by systems or humans, can lead to undesired heterogeneity in data among their sources or over time. This multi-source and temporal variability of data can be reflected on their statistical distributions, representing a Data Quality (DQ) issue which must be addressed for a reliable data reuse.
In this talk we will first review proposed methods for multi-source and temporal DQ assessment. These include a set of metrics to measure and visualize the data concordance among multiple sources (e.g., the Global Probabilistic Deviation and the Source Probabilistic Outlyingness), and an exploratory methodology to describe the variability of data over time (e.g., Information Geometric Temporal (IGT) plots). In the second part, we will describe the application of these methods to a selection of case studies, including the Public Health Mortality Registry of the Region of Valencia, Spain, the US National Hospital Discharge Survey, and a pilot project by the Spanish Ministry of Health, Social Services and Equality envisaged to a standardized and DQ assessed maternal-child care integrated data repository for research and monitoring of best practices. We will describe the generic usage of these methods using the Open Source R package EHRtemporalVariability developed by the lab.
Speaker: Carlos Saez, PhD in Technologies for Health and Well-Being (2016), MsC in Artificial Intelligence, Pattern Recognition and Digital Imaging (2009), is a postdoctoral researcher at the Biomedical Data Science Lab of the ITACA Institute of the Universitat Politècnica de València (UPV), Spain. His current research addresses data quality (DQ) and variability assessment of biomedical data, focusing specially on automated characterization of big data variability over time and among multiple sources (e.g., hospitals, devices, professionals). Two of his first author publications were selected as best of published worldwide by the International Medical Informatics Association in the fields of Secondary Use of Patient Data (2016) and Health Information Systems (2013). He is a visiting research fellow at the Department of Biomedical Informatics, Harvard Medical School, US, and the Center for Research in Health Technologies and Information Sciences (CINTESIS), University of Porto, Portugal.

Permanent link to this article: https://team.inria.fr/graphik/news-3/december-12-2019-200-pm-carlos-saez-universitat-politecnica-de-valencia/