In order to train this model, two datasets of PPG signals are required: one consisting of clean PPG signals and the other one containing noisy PPG signals. (In the rest of this paper, by noisy PPG signals we are referring to PPG signals affected by motion artifact.) The model’s evaluation requires both clean and noisy signals to be taken from the same patient in the same period of time. However, recording such data is not feasible as patients are either performing an activity, which leads to recording a noisy signal or are in a steady-state, which produces a clean signal. For this reason, we simulate the noisy signal by adding data from an accelerometer to the clean signal. This is a common practice and has been used earlier in related work (e.g., [
34]) to address this issue. This way, the effectiveness of the model can be evaluated efficiently by comparing the clean signal with the reconstructed output of the model on the derived noisy signal. In the following subsections, we explain the process of data collection for both clean and noisy datasets.
2.2 Data Collection
We conducted laboratory-based experiments to collect accelerometer data for generating noisy PPG signals. Each of these laboratory-based experiments consisted of 27 minutes of data. A total of 33 subjects participated in the laboratory-based experiments. The ages of the subjects ranged from 20 to 62, and 17 of them were males while 16 were females. In each experiment, subjects were asked to perform specific activities while the accelerometer data were collected from them using an Empatica E4 [
2] wristband worn on their dominant hand. The Empatica E4 wristband is a research-grade wearable device that offers real-time physiological data acquisition, enabling researchers to conduct in-depth analysis and visualization. A recent research study detects and discriminates
acute psychological stress (APS) in the presence of concurrent
physical activity (PA) using the PPG and the accelerometer data collected from Empatica E4 wristband [
35]. Figure
3 shows our experimental procedure. Note that the accelerometer signals are only required for generating/emulating noisy PPG signals, and our proposed motion artifact removal method does not depend on having access to acceleration signals.
According to Figure
3, each experiment consists of six different activities: (1) Finger Tapping, (2) Waving, (3) Shaking Hands, (4) Running Arm Swing, (5) Fist Opening and Closing, and (6) 3D Arm Movement. Each activity lasts 4 minutes in total, including two parts with two different movement intensities (low and high), each of which lasts 2 minutes. Activity tasks are followed by a 30 seconds
rest (R) period between them. During the rest periods, participants were asked to stop the previous activity and put both their arms on a table, and stay in a steady state. Accelerometer data collected during each of the activities were later used to model the motion artifact. We describe this in the next subsection.
2.3 Noisy PPG Signal Generation
To generate noisy PPG signals from clean PPG signals, we use accelerometer data collected in our study. Clean PPG signals are directly collected from the BIDMC dataset. Accelerometer data is taken at 32 Hz, thus we down-sample the clean signals to 32 Hz to ensure they are synchronized with the collected accelerometer data.
Empatica has an onboard MEMS type 3-axis accelerometer that measures the continuous gravitational force (g) applied to each of the three spatial dimensions (x, y, and z). The scale is limited to
\(\pm 2\)g. Figure
4 shows an example of accelerometer data collected from E4.
Along with the raw 3-dimensional acceleration data, Empatica also provides a moving average of the data. Every second, the following summation is calculated over the (32 samples) input received from the accelerometer sensor,
where
\(\text{Acc}_i[t]\) is the value of the accelerometer sensor (g) along the
\(i\)-th dimension at time frame (sample)
\(t\), and
\(\text{Acc}_i[1]\) is the first value of the accelerometer sensor (g) along the
\(i\)-th dimension in the current window. The
\(\max (a,b,c)\) function simply returns the maximum value among
\(a\),
\(b\), and
\(c\). It is worth to mention that the values stored in the arrays
\(\text{Acc}_x,\text{Acc}_y\), and
\(\text{Acc}_z\) change after each window is processed.
Afterwards, the value of the moving average for the new window will be calculated based on this summation and the value of the moving average on the previous window,
Figure
5 visualizes this moving average over the data.
This filtered output (Avg) is directly used as a model for motion artifacts in our study. To simulate the noisy PPG signals, we add this artifact model to a 2 minutes window of the clean PPG signals collected from the BIDMC dataset. We use 40 out of 53 signals in BIDMC directly as the clean dataset for training. Among these 40 signals, 20 are selected and augmented with the accelerometer data to construct the noisy dataset for training. The 13 remaining BIDMC signals and accelerometer data were added together to form the clean and noisy datasets for testing. In the rest of this section we describe each part of the model introduced in Figure
1.
2.4 Noise Detection
To perform noise detection, first, the raw signal, which is downsampled to 32 Hz, is normalized by a linear transformation to map its values to the range
\((0,1)\). This can be performed using a simple function as below:
where
\(\text{Sig}_{\text{raw}}\) is the raw signal and
\(\text{Sig}_{\text{norm}}\) is the normalized output. Then, the normalized signal is divided into equal windows of size 256, which is the same window size we use for noise removal. These windows are then used as the input of the noise detection module to identify the noisy ones.
The similar type of machine learning network used in [
42] can be employed as a noise detection system. To explain the network structure for the noise detection method (Table
1 and Figure
6), first, we use a 1D-convolutional layer with 70 initial random filters with a size of 10 to select the basic features of the input data and convert the matrix size from
\(256\times 1\) to
\(247\times 70\). To extract more complex features from the data, another 1D-convolutional layer with the same filter size 10 is required. As the third layer, a pooling layer with a filter size of 3 is utilized. In this layer, a sliding window slides over the input of the layer and in each step, the maximum value of the window is applied to the other values. This layer converts a matrix size of
\(238\times 70\) to
\(79\times 70\). To select additional complex features, another set of convolutional layers are used with a different filter size. This set is followed by two fully connected layers of sizes 32 and 16. Lastly, a dense layer of size 2 with a softmax activation would produce the probability of each class: clean and noisy. The maximum of these two probabilities would be identified as the result of the classification. The accuracy of our proposed binary classification method is 99%, which means that the system can almost always detect a noisy signal from a clean signal.
2.5 Noise Removal
In this section, we explore the reconstruction of noisy PPG signals using deep generative models. Once a noisy window is detected, it is sent to the noise removal module for further processing. First, the windows are transformed into 2-dimensional images, to exploit the power of existing image noise removal models, and then a trained CycleGAN model is used to remove the noise induced by the motion artifact from these images. In the final step of the noise removal, the image transformation is reversed to obtain the clean output.
The transformation needs to provide visual features for unexpected changes in the signal so that the CycleGAN model would be able to distinguish and hence reconstruct the noisy parts. To extend the 1-dimensional noise on the signal into a 2-dimensional visual noise on the image, we apply the following transformation:
where Sig is a normalized window of the signal, Img is the 2d array storing the grayscale image, and
\(i\) and
\(j\) are time frames in the window. Each pixel, i.e., each entry of Img, will then have a value between 0 and 255, representing a grayscale image. An example of such transformation is provided in Figure
7 for both clean and noisy signals. According to this figure, the noise effect is visually observable in these images.
Autoencoders and CycleGAN are two of the most powerful approaches for image translation. These methods have proven to be effective in the particular case of noise reduction. Autoencoders require the pairwise translation of every image in the dataset. In our case, clean and noisy signals are not captured simultaneously, and their quantity differs. CycleGAN, on the other hand, does not require the dataset to be pairwise. Also, the augmentation in CycleGAN makes it practically more suitable for datasets with fewer images. Hence, we use CycleGAN to remove motion artifacts from noisy PPG signals and reconstruct the clean signals.
CycleGAN is a Generative Adversarial Network designed for the general purpose of image-to-image translation. CycleGAN architecture was first proposed by Zhu et al. in [
43].
The GAN architecture consists of two networks: a generator network and a discriminator network. The generator network starts from a latent space as input and attempts to generate new data from the domain. The discriminator network aims to take the generated data as an input and predict whether it is from a dataset (real) or generated (fake). The generator is updated to generate more realistic data to better fool the discriminator, and the discriminator is updated to better detect generated data by the generator network.
The CycleGAN is an extension of the GAN architecture. In the CycleGAN, two generator networks and two discriminator networks are simultaneously trained. The generator network takes data from the first domain as an input and generates data for the second domain as an output. The other generator takes data from the second domain and generates the first domain data. The two discriminator networks are trained to determine how plausible the generated data are. Then the generator models are updated accordingly. This extension itself cannot guarantee that the learned function can translate an individual input into a desirable output. Therefore, the CycleGAN uses a cycle consistency as an additional extension to the model. The idea is that output data by the first generator can be used as input data to the second generator. Cycle consistency is encouraged in the CycleGAN by adding an additional loss to measure the difference between the generated output of the second generator and the original data (and vice versa). This guides the data generation process toward data translation.
In our CycleGAN architecture, we apply adversarial losses [
16] to both mapping functions (
\(G: X\rightarrow Y\) and
\(F: Y\rightarrow X\)). The objective of the mapping function
\(G\) as a generator and its discriminator
\(D_Y\) is expressed as below:
where the function
\(G\) takes an input from domain
\(X\) (e.g., noisy PPG signals), attempting to generate new data that look similar to data from domain
\(Y\) (e.g., clean PPG signals). In the meantime,
\(D_Y\) aims to determine whether its input is from the translated samples
\(G(x)\) (e.g., reconstructed PPG signals) or the real samples from domain
\(Y\). A similar adversarial loss is defined for the mapping function
\(F:Y\rightarrow X\) as
\(L_{GAN}(F, D_X,Y,X)\).
As discussed before, adversarial losses alone cannot guarantee that the learned function can map an individual input from domain X to the desired output from domain
\(Y\). In [
43], the authors argue that to reduce the space of possible mapping functions even further, learned mapping functions (
\(Y\) and
\(F\)) need to be cycle-consistent. This means that the translation cycle needs to be able to translate back the input from domain
\(X\) to the original image as
\(X\rightarrow G(X) \rightarrow F(G(X)) \sim X\). This is called forward cycle consistency. Similarly, backward cycle consistency is defined as:
\(y\rightarrow F(y)\rightarrow G(F(y))\sim y\). This behavior is presented in our objective function as:
Therefore, the final objective of CycleGAN architecture is defined as:
where
\(\lambda\) controls the relative importance of the two objectives.
In Equation (
7),
\(G\) aims to minimize the objective while an adversary
\(D\) attempts to maximize it. Therefore, our model aims to solve:
The architecture of the generative networks is adopted from Johnson et al. [
21]. This network contains four convolutions, several residual blocks [
19], and two fractionally-strided convolutions with stride 0.5. For the discriminator networks, they use
\(70\times 70\) PathGANs [
20,
23,
24].
After the CycleGAN is applied to the transformed image, the diagonal entries are used to retrieve the reconstructed signal.