Work Packages - QUMPHY

WP1: Uncertainty Quantification for ML models for PPG signals

The aim of this work package is to develop methods for quantifying the uncertainty in supervised machine learning and deep learning models to enable the performance of at least 3 classification and 3 regression tasks involving PPG data, considering the effects of both aleatoric (data) and epistemic (model) uncertainty on model predictions. Conventionally, one divides the total predictive uncertainty into aleatoric and epistemic uncertainty. Aleatoric uncertainty represents the noise inherent in the observed data, whereas epistemic uncertainty accounts for the uncertainty in the model, i.e., uncertainty which can usually be explained away given more data. A common framework will be developed to evaluate the performance of trained models and to quantify and validate the uncertainties obtained for those models, thus permitting the accuracy and uncertainty of models to be compared, and models with high accuracy and low uncertainty to be identified.

This work package covers the main technical contribution of the project by the following tasks:

Task 1.1:
The training of ML models for different classification and regression problems using different data representations. The models will be chosen to cover the breadth of models used for analysing PPG data in practical applications. Based on the input representation, the considered algorithms will cover deep learning (DL) models, in particular models operating on high-dimensional features or raw data, and classical machine learning (ML) algorithms. These models will be trained using a dataset for one prototypical classification problem and one prototypical regression problem, which will be selected in A2.1.3 at the start of the project to ensure that the model development can start as early as possible. For this reason, the prototypical datasets do not necessarily have to match the prediction problems (and datasets) selected for the final benchmarking. In order to simplify the description below, we always assume that the procedure is applied to both classification and regression tasks, i.e., training one ML model refers to the training of one ML model for each kind of task.

Task 1.2:
The implementation and adaptation of different methods for uncertainty quantification. Uncertainty quantification methods to be considered include: Deep ensembles (A1.2.1) as a model-agnostic uncertainty quantification method that relies on training multiple prediction models with different random seeds and obtains the uncertainty information from an ensemble model combining them; Monte-Carlo Dropout (A1.2.2) as a very efficient uncertainty estimation method that performs approximate Bayesian inference by randomly switching off neurons. Calibration methods (A1.2.3) are typically implemented in a post-hoc fashion to improve the calibration, i.e., to eventually have model predictions that coincide with empirical occurrence probabilities. Conformal prediction (A1.2.4) is a distribution-free uncertainty estimation framework that yields uncertainty estimates with statistically rigorous coverage guarantees. Performing maximum a posteriori (MAP) inference (A1.2.5) enables us to obtain estimates of the (heteroscedastic) aleatoric uncertainty. The third objective of the project is the validation of the uncertainties. The methodology for this evaluation will be implemented in A1.2.6. Finally, a first quantitative evaluation for a limited set of models and for the prototypical tasks from A1.1.1 will be carried out in A1.2.7, which forms the basis for the full evaluation in Task 1.3 together with the performance evaluation A1.1.7.

Task 1.3:
The quantitative evaluation of the methods from Task 1.2 including a standardised comparison of the different models along with a chosen uncertainty quantification method according to different quality criteria, most notably quantitative performance and the quality of the uncertainty estimate. In A1.3.1, we repeat the analyses from A1.1.7 and A1.2.7 but this time for all models developed in Task 1.1 and use it for a first comparative assessment of accuracy vs. uncertainty on in-distribution data, i.e., for test and training data drawn from the same distribution. In A1.3.2 the results will be analysed for different demographic subgroups e.g., sex, and the results compared with using single-sex training data. The classification of PPG signals by biological sex will also be considered. The study of out-of-distribution (OOD) data, either due to a test set from a different distribution source or due to input noise applied to the original data, is the focus of A1.3.3. In A1.3.4. we will study the parametric dependence of the results on various parameters such as the number of samples in the dataset, the sampling rate, PPG filtering cut-off frequencies, signal quality and other PPG-specific dataset characteristics. In A.1.3.5, the effect of aleatoric vs. epistemic uncertainty and its applicability for different analysis tasks will be investigated. The classification of PPG signals by skin tone will be studied in A1.3.6 to determine whether or not machine learning can distinguish between signals from different skin tones. Finally, the results of all analyses will be compiled into a comprehensive report, which represents the most prominent project deliverable from WP1.

The technical developments follow a modular structure. In particular, models built in Task 1.1 can be combined with any model-agnostic and suitable model-specific uncertainty quantification methods developed in Task 1.2. To ensure comparability, all combinations of models and uncertainty quantification method are evaluated using a common evaluation framework implemented in A1.1.6 for the quantitative accuracy and in A1.2.6 for the quantitative assessment of the quality of the uncertainty estimates, which is carried out in Task 1.3.

WP2: Creation of benchmark datasets involving PPG signals and community uptake

The aim of this work package is to generate 5 benchmark problems containing datasets of clinical interest on which the performance and uncertainty metrics from A1.1.6 and A1.2.6 can be evaluated, and to promote the uptake and implementation of the methods for uncertainty quantification developed during the project to the scientific, medical device, digital and healthcare communities. The chosen problems will include both classification and regression problems and will be selected according to their clinical interest and the availability of public open-source data. Appropriate meta-data will be included with the datasets. The datasets will contain some variety, i.e., they may contain real, synthetic, or phantom data, data of varying quality, signals from different sites on the body, and different demographics. Examples of benchmark problems for which publicly available datasets exist include determining (i) systolic and diastolic blood pressure (ii) blood glucose and (iii) vascular age for regression, and detection of (iv) atrial fibrillation and other heart rhythm conditions (v) diabetes and (vi) aneurysms, arterial stiffening or stenoses for classification problems. Other problems might however be considered as well. In addition, a good practice guide and an accompanying code repository for an independent review of machine learning models will be developed.