1. Introduction
In the last two decades, the interest in the variation of the timing between beats (RR-intervals) of the cardiac cycle, called heart rate variability (HRV), has widely increased in the psycho-physiological research field. Assessment of RR-intervals variability is possible through time and frequency domain analyses that provide parameters able to quantify the amount of fluctuations occurring between consecutive beats, giving therefore an indirect index of autonomic regulation. Actually, the parameters extracted from HRV analysis are useful to provide insight about sympathetic-parasympathetic balance of cardiac vagal tone that was found to be an indicator of cognitive, emotional, social and health status [
1].
Thanks to the technological advancements of recent decades, it is now possible to continuously record heart activity during peoples’ life via wrist-worn wearable devices equipped with heart rate sensors. This innovation might have a great impact on the medical field because of the low cost of the devices and the possibility to obtain continuous passive measurements performed in an ecological setting, gaining an overview of the users’ health status by assessing HRV features during their daily life [
2]. These wrist-worn wearable devices, however, produce several inconsistent RR-intervals produced not only by ectopic beats (e.g., atrial fibrillation and premature heart beat), but mainly by motion and mechanical artifacts induced by external stimuli. The number of abnormal RR-intervals increases from 1%—when heart beats are recorded with gold standard technology (i.e., electrocardiography)—[
3] to more than 10%—when they are recorded with wrist-worn wearable devices. However, standard methods for calculating HRV features from the time-series of RR-intervals require accurate beat detection. Hence, handling the missing values became a fundamental aspect to correctly evaluate users’ physiological response. As a matter of fact, these missing values affect the HRV analysis producing misleading results [
4]. In previous studies, the inconsistent RR-interval data were handled by reconstructing the missing values using nearest-neighbour, linear, cubic spline and piecewise cubic Hermite interpolation methods [
4,
5]. However, these methods can also introduce changes in the reconstructed timeseries that could corrupt the signal spectrum [
6], thus reducing the ability to estimate both time or frequency domains HRV features.
In this paper, we focus on healthy subjects with normal heart activity, and investigate the effects of interpolation on time (i.e., the timestamps when the heartbeats happen) and duration (i.e., the duration of the heartbeats) with an increasing amount of missing values (from 0% to 70%) in order to assess which interpolation strategy yields better results when estimating HRV features. In particular, in this paper we show that quadratic interpolation on time is the best approach to reconstruct the missing RR-intervals. Anyway, the main finding of this study is that the interpolation on time produce better HRV feature estimation that the interpolation on Duration suggested by all the previous studies.
1.1. Paper Contribution
To the best of our knowledge, this work is one of the first studies investigating the effect of high percentage of missing values (i.e., 30%, 50% and 70%) on HRV analysis. In previous studies, the inconsistency of RR-intervals was due to a small number of ectopic beats, while wrist-worn wearable devices introduce motion and mechanical artifacts that produce a huge quantity of abnormal heart beats.
Moreover, to the best of our knowledge, this is the first study to analyse the effect on HRV features of interpolation on time versus interpolation on duration. We show the difference among interpolation methods (i.e., no-interpolation, nearest neighbor, linear, quadratic and cubic spline) on both time and duration timeseries in order to detect which interpolation method yields lower error in HRV features estimations. This analysis permits to provide insight about how the interpolation methods work in quantifying the noise introduced into the timeseries.
We conclude by showing that interpolation on time is the best choice for preprocessing RR timeseries with missing values, contradicting the approach traditionally followed, based on durations timeseries.
1.2. Related Work
During the day, approximately 1% of beats are to be expected to be ectopic [
3] when they are recorded by using gold standard instrument (i.e., Electrocardiography). An ectopic beat is a disturbance of the cardiac rhythm that induces premature ventricular or atrial contraction. The physiological artifact producing inconsistent beat seriously affects the HRV spectrum, and could result in erroneous results during HRV analysis by introducing non-existing frequencies into the spectrum [
7]. In addition to physiological artifact, motion and mechanical artifacts induced by external stimuli introduce a large amount of inconsistent beats when the data are recorded by using wrist-worn wearable device [
4,
5,
6]. This work is one of the firsts studies that investigate the effect of huge quantities of inconsistent beats that are not only derived from ectopic beats. Since missing data are common in the RR-interval timeseries derived from wrist-worn wearable device, they could complicate the analysis of HRV features making it sometimes impossible. To make reliable HRV analysis, previous studies suggested several preprocessing methods for RR-intervals timeseries (e.g., deletion, interpolations and filtering). However, these preprocessing methods have their own distinct effect on HRV analysis yielding different results [
7].
The simplest way of handling the inconsistent RR-intervals provided in literature is to delete them [
8]. In this approach, the abnormal RR-intervals are removed and the normal RR-intervals list are merged together. A huge issue of the deletion approach is that it reduces the overall length of the HRV signal. This may significantly influence HRV spectrum [
8]. Other interpolation methods maintain the original number of samples, but, by manipulating the duration of RR-intervals, they also change the overall duration by some amount. There are several interpolation approaches useful for handling inconsistent RR-intervals, i.e., zero degree, linear and cubic spline [
9]. Zero degree replaces the inconsistent RR-intervals with the mean of the closest normal values. Differently, linear interpolation fits a straight line over the inconsistent RR-intervals to obtain normal values. Finally, the most popular interpolation approach is the spline of order three (i.e., cubic spline). It fits a third degree polynomial smooth curve through a number of data points to obtain new values. This latter approach is recommended when there is only small number of inconsistent RR-intervals [
9].
Finally, it was found that the interpolation introduces low frequency components (LF) and reduces high-frequency components (HF) power [
6]. This aspect affects frequency domain HRV features [
5], while little effect was found in time domain HRV features [
4].
We were not able to find any previous work studying the effect of interpolation missing values on the duration versus time, and the propagation of error to HRV features.
2. Materials and Methods
2.1. Dataset
In this paper, we used
nsr2db (Normal Sinus Rhythm RR Interval Database) PhysioNet dataset [
10]. This dataset contains beat annotations of 54 normal sinus rhythm subjects (30 men: 28–76 years; 24 women: 58–73 years) extracted from 23 h long electrocardiogram (ECG) recordings, digitized at 128 samples per second, and beat annotations obtained by automated analysis with manual review and correction.
In order to compute HRV features, the 23 h time series of ECG recording of each user were split into 5 min windows. Moreover, to investigate the effect of missing values on HRV analysis, artificial missing RR-intervals (i.e., 30%, 50% and 70% of missing values) were inserted into the 5 min windows.
The missing values were created in accordance with a burst Gilbert model that simulates burst-error with a two-state Markov chain (i.e., good as 0 and bed as 1) [
11]. We define
P as the probability of transition form state 0 to the state 1 and
p the probability of transition from state 1 to 0. Moreover,
Q and
q give the probabilities of remaining in the same states 0 or 1 (see
Figure 1). Using these parameters, it is possible to represent average bit-error rate
as showed in Equation (
1) and the average burst length (
) is set at 10.
Given these equation, we define
P,
p,
Q and
q as showed in Equations (
2)–(
5), respectively.
where the
is set in accordance with the missing values percentage that we want to add in the time series (e.g., if we want 20% of missing values we set
as 0.3). The missing values were introduced in the time series when the state of the two-state Markov chain is equal to 1. Examples of 30%, 50% and 70% of missing values created by Gilbert model are provided in
Figure 2.
2.2. Missing Values Interpolation
The missing values were then handled with six different interpolation methods:
No interpolation: this approach does not create interpolated values of missing RR-intervals. Differently to the Deletion method that remove missing values merging non-consecutive beats that induce in missing interpretation of HRV features, the no-interpolation method maintains the missing values into the RR-intervals time series.
Nearest neighbor: the nearest neighbor or proximate interpolation is the easiest interpolation method [
12]. This interpolation assigns the value of the closest known (existing) neighbor to the missing- value as shows in Equation (
6).
where
a and
b are the indexes of
and
. Interpolated data by this method are discontinuous and it often yields the worst results [
13]
Linear: this method fits a straight line passing through points
and
[
14]. Interpolated data by the linear model are bound between
and
as showed in Equation (
7).
Gaunck et al. [
14] demonstrated that this method is efficient, and most of the time it is better than non-linear interpolations for predicting missing values in environmental phenomena with constant rates. In addition, they also found that in average this interpolation model underestimated the real values but it strongly depends on the distribution of the data.
Quadratic: differently from the linear interpolation model, the quadratic function needs three points of interest to interpolate missing values in a time series as showed in Equation (
8).
Compared to the linear model, quadratic interpolation is found to be in general more accurate [
13].
Spline cubic: fitting datapoints using polynomials of degree higher than one leads to problems of oscillation outside the fitted points, known as Runge’s phenomenon [
15]. This problem can be avoided by using a spline, a function defined piecewise by polynomials, using datapoints as control points instead of forcing the fitted function to pass through the data points. Cubic spline is a spline composed of piecewise third-order polynomials. By using third degree polynomials is possible to ensure that the resulting curve is smooth [
15], avoiding the problem of the straight polynomial interpolation that tends to induce distortions on the edges of the polynomials, given by the fact that, in general, the first and second derivative of the function defined by piecewise polynomials will not be continuous at the edges of polynomials. With cubic spline, it is possible to force the first and second derivatives of consecutive polynomials to be equal, ensuring smoothness of the resulting curve.
We applied each of the interpolation methods listed above to heartbeats expressed as a sequence of durations and as a sequence timestamps, then analyzed the error in HRV features estimations, in order to identify the best approach.
The on-duration approach is the one mostly used in literature to handle missing values. The data used as input to the interpolation methods was the sequence of durations of the heartbeats (the RR-intervals), obtained by subtracting the timestamp of each heartbeat from the timestamp of the subsequent heartbeat in the sequence of heartbeats.
Differently, we propose the the on-time approach whereby interpolation methods are applied to the sequence of timestamps of the heartbeats, postponing the differentiation preprocessing step that transforms timestamps into durations to after the interpolation step.
As shown in
Figure 3 the difference between the on-time and on-duration approaches is the order of the processing steps: in the on-duration approach the timestamps are converted to durations as the first processing step; in the on-time approach this step is performed after interpolation is performed.
To better illustrate the differences between interpolation on time and duration, we simulated 100 heartbeats and then we randomly generated 10% of missing RR-intervals in this artificial timeseries by using Gilbert burst approach. The length of the RR-intervals timeseries changes when we interpolate the missing values on duration, while it remains the same when we interpolate on time (
Table 1). This result suggests that the interpolation on duration moves beats away from their original position in time, introducing changes to the spectrum, while interpolation on time preserves the position on time of retained heartbeats. In particular,
Table 1 shows that the low RR-intervals error (i.e., average difference between heartbeats duration) is obtained with linear interpolation on time. The nearest interpolation on time was not performed because interpolating with this approach is useless, as the interpolated values introduced into the timeseries would have the same time as the closest beat, creating physiologically impossible data.
Figure 4 provide more detailed analyses of the difference between linear interpolation on both duration and time. The cumulative error when the missing values are interpolated on duration increases as the time series goes by because it creates RR-intervals in accordance with the closest interval values (i.e., the higher is the number of missing values, the higher is the cumulative error) depending on the interpolation type used (e.g., linear, quadratic and cubic spline). Differently, time interpolation did not introduce change in time series length due to the fact that this approach estimates intermediate values between the time when two observed beats happen in accordance with interpolation type.
2.3. Feature Engineering
To obtain HRV features, we analyzed real (i.e., without missing values values), non-interpolated, and interpolated (i.e., with different percentage of artificial missing values values) 5 min ECG time series. We analyze time domain HRV features, frequency, and non-linear domains. Time domain analysis usually contains various statistical variables of the duration time series. The frequency domain analysis investigates the power spectrum of RR-intervals time series in order to assess the cardiac autonomic balance (i.e., sympathetic and parasympathetic nervous systems activity). Additionally, non-linear HRV features try to capture the non-periodic behaviour of the HRV and the complexity that exists inside the RR-interval dynamics. The variables that we incude in our analysis, in both time and frequency domain, are defined as:
Time domain:
- -
HR mean: mean values of heart rate (HR) computed as showed in Equation (
9).
where N is the number of beats and R is the time when the beats happened.
- -
RMSSD: root mean square of the successive RR-intervals differences (Equation (
10)) represents the strength of the autonomic nervous system (specifically the parasympathetic branch) at a given time.
where N is the number of beats and R is the time when the beats happened.
- -
SDNN: standard deviation of RR-intervals (Equation (
11)). It reflects the cyclic components responsible for variability in the RR-intervals time series. The SDNN is the “gold standard” for medical stratification of both morbidity and mortality [
16].
where N is the number of beats and RR is the intervals between two consecutive R and
is the mean of RR-intervals in the time series.
- -
PNN50: the ratio between NN50 (i.e., number of pairs of successive RR intervals that differ by more than 50 ms) and the total number of RR-intervals (Equation (
12)).
Frequency domain:
- -
Power spectral density (PSD): describes the distribution of power into frequency components composing that signal. The Lomb–Scargle periodogram for PSD estimation was found to be the most appropriate method to analyze RR-interval data [
5,
6]. VLF (power in very-low-frequency ranges, i.e., ≤0.04 Hz), LF (power in low-frequency ranges, i.e., 0.04–0.15 Hz), HF (Power in high-frequency ranges, i.e., 0.15, 0.4 Hz), LF/HF ratio (ratio between LF and HF expressed as ms
), and total power (Power in all the frequency ranges, i.e., ≤0.4) were obtained by the sum of the power in the relevant frequency range in the spectrum.
Non-linear HRV features:
- -
Poincaré plot: it is a type of recurrence plot used to quantify self-similarity in processes. A Poincaré plot is a graph of RR interval () against the previous one (). From this scatter plot, it is possible to quantitatively analyze the variance of two consecutive RR-intervals by fitting an ellipse to the plotted shape. is the standard deviation of Poincaré plot perpendicular to the line-of-identity, while is the standard deviation of the Poincaré plot along the line-of-identity.
2.4. Success Metrics
We assessed the difference of HRV variables computed on real time series and the ones with missing values by the root mean squared error (RMSE). Additionally, the relative errors (REs, see Equation (
13)) were used to assess the effects of the missing data on the HRV features compared with the parameters calculated from the RR-intervals timeseries without missing data.
where
refers to the HRV features computed from RR-intervals timeseries without missing values, while
refers to the values obtained from interpolated timeseries.
4. Conclusions
In this work we quantify the expected error propagation of missing values in RR-intervals timeseries to HRV features, as a function of preprocessing interpolation approach, and amount of missing data. The main findings of this study is that the interpolation of missing values in RR-intervals timeseries on time (i.e., the heartbeats timestamps) produces more reliable HRV features estimations compared to interpolation on duration.
By using this preprocessing approach, the quantification of the expected error on HRV features caused by a huge amount of missing values (e.g., motion artifacts on a wrist-worn wearable device) can support better estimations of users’ well-being, by assessing their HRV features. This enables continuous passive monitoring of users’ cardiovascular activity in a non-obtrusive way, collecting data during their daily activities that could enable further research on preventative health.
A limitation of this study is the fact that we limited our focus on healthy subjects with normal heart activity, limiting the analysis to large amounts of missing values induced by motion artifacts, ignoring physiological phenomena such as ectopic beats.
Future studies will be useful for researcher and companies, which give insight into heart rate variability recorded by wrist worn IoT wearable devices, in order to better understand the potentiality of the data extracted from these devices to make inference about people heath status. Future work is needed to assess the influence of missing values simulated in accordance with motion and mechanical artifacts induced by external stimuli during data acquisition by using wrist worn IoT wearable devices. Finally, future works will also include the investigating the influence of missing values on HRV features on short timeseries (e.g., 2 min, 1 min and 30 s) and the identification of the shortest time required to obtain accurate estimation of users’ HRV features.