Sleep stage classification with ECG and respiratory effort

Automatic sleep stage classification with cardiorespiratory signals has attracted increasing attention. In contrast to the traditional manual scoring based on polysomnography, these signals can be measured using advanced unobtrusive techniques that are currently available, promising the application for personal and continuous home sleep monitoring. This paper describes a methodology for classifying wake, rapid-eye-movement (REM) sleep, and non-REM (NREM) light and deep sleep on a 30 s epoch basis. A total of 142 features were extracted from electrocardiogram and thoracic respiratory effort measured with respiratory inductance plethysmography. To improve the quality of these features, subject-specific Z-score normalization and spline smoothing were used to reduce between-subject and within-subject variability. A modified sequential forward selection feature selector procedure was applied, yielding 80 features while preventing the introduction of bias in the estimation of cross-validation performance. PSG data from 48 healthy adults were used to validate our methods. Using a linear discriminant classifier and a ten-fold cross-validation, we achieved a Cohen’s kappa coefficient of 0.49 and an accuracy of 69% in the classification of wake, REM, light, and deep sleep. These values increased to kappa = 0.56 and accuracy = 80% when the classification problem was reduced to three classes, wake, REM sleep, and NREM sleep.

Keywords: sleep staging, ECG, respiratory effort, feature selection (Some figures may appear in colour only in the online journal)

Introduction
A problem with traditional sleep monitoring, known as polysomnography (PSG), is that a wide array of potentially sleep-disturbing sensors must be applied to the body and their measurements can only be interpreted by highly trained sleep technicians or scientists. Albeit invaluable in the diagnostic of sleep disorders, traditional PSG is rather ill-suited for regular, non-diagnostic monitoring of sleep and will only introduce more sleep disturbances when applied on a daily basis by untrained individuals. This scenario makes apparent a need for unobtrusive methods of sleep monitoring, preferably inexpensive and with no training required to operate them. Cardiorespiratory monitoring can be unobtrusive and the data can be analyzed by a computer, which makes this technology a promising candidate for personal, continuous and unobtrusive sleep monitoring.
Cardiorespiratory sleep staging or sleep stage classification is often based on heart rate variability (HRV) calculated from electrocardiogram (ECG) and respiratory effort, often from respiratory inductance plethysmography (RIP). Usually cardiorespiratory information is combined with body movements from an accelerometer to more accurately distinguish wake from sleep. One of the earliest studies that presented a successful machine learning approach to cardiorespiratory sleep stage classification with these modalities was done by Redmond and Heneghan (2006). Using a set of HRV features to model the autonomic nervous activity and a set of respiratory features to model the parasympathetic tone, Redmond and colleagues showed the viability of a sleep stage classifier that can generate a simplified hypnogram for an entire night indicating, for each 30 s segment, a sleep stage, classified as either wake, rapid-eye-movement (REM) sleep, or non-REM (NREM) (wake-REM-NREM or WRN classification for short). More recent research has shown that it is possible to obtain the same cardiorespiratory information from other sensors for sleep stage classification, such as from bed-mounted ballistocardiogram Watanabe 2004, Kortelainen et al 2010) or contactless radio frequency (de Chazal et al 2011). Although these studies focused on distinction between wake, REM sleep, and NREM sleep (without separating NREM sleep in other sleep stages) or between wake and sleep (merging REM and NREM sleep), these attempts promised that cardiorespiratory methods could one day be completely unobtrusive.
In previous work (Long et al 2014d we proposed methods to simultaneously classify wake, REM sleep, light sleep (NREM stage S1 and S2), and deep sleep or slow wave sleep (stage S3 and S4) using respiratory activity in order to estimate an overnight wake-REM-light-deep sleep (WRLD) hypnogram. In comparison with WRN classification, achieving WRLD classification would allow a more adequate assessment of sleep since deep sleep is thought by some researchers to be important in several cortical and physiological processes, such as energy conservation (Berger and Phillips 1995), cerebral restoration (Benington and Heller 1995), and memory processing and consolidation (together with sleep spindles which occur during NREM stage 2) (Stickgold 2005, Walker 2009). In that work, we also reviewed the state-of-the-art in sleep stage classification with cardiac and/ or respiratory activity. The methods presented there will be used to benchmark the method proposed in this paper. In addition, two studies with comparable results have been proposed by Domingues et al (2014) and Willemen et al (2014). However, these works only report results on a three-class task (WRN classification) or use non-standard one-minute epochs for classification, respectively. This paper presents a methodology for automatic sleep stage classification based on machine learned models of the autonomic nervous system during sleep from ECG and RIP signals. Compared to previous studies, our methodology includes novel features, new feature post-processing methods, and a refined feature selection method which guarantees that no bias is introduced in the validation of the algorithm while avoiding the use of a hold-out validation set. These methods are applied for three-class (WNR) and four-class (WRLD) sleep stage classification of healthy subjects.

Data sets
The data set was the same as used in earlier work (Long et al 2014d and comprised full single-night polysomnographic (PSG) recordings of 48 subjects (27 females) acquired in the SIESTA project (Klosch et al 2001). All subjects were healthy sleepers with a Pittsburgh Sleep Quality Index (Buysse et al 1989) of less than 6 and had no regular sleep complaints nor earlier diagnosis of sleep disorders. The subjects had an average age of (± ) 41.3 16.1 years at the time of the recording. Full subject demographics can be found in our earlier work (Long et al 2014d). Sleep stages were scored by trained sleep technicians in six classes according to the R and K rules (Rechtschaffen and Kales 1968). In the scope of this study, S1 and S2 were merged in a single L (light sleep) class and S3 and S4 were merged in a single D (deep sleep) class. Each PSG recording comprised, besides the standard signals required for sleep scoring, modified lead II ECG, and (thoracic) respiratory effort recorded with respiratory inductance plethysmography (RIP). QRS complexes were detected and localized from ECG signals using a combination of a Hamilton-Tompkins detector (Hamilton andTompkins 1986, Hamilton 2002) and a post-processing localization algorithm . Prior to feature extraction, RIP signals were filtered with a 10th order Butterworth low-pass filter with a cut-off frequency of 0.6 Hz, after which baseline was removed by subtracting the median peak-to-through amplitude (Long et al 2014d).

Feature extraction
We extracted a set of 142 features from cardiac and respiratory activity, and from cardiorespiratory interaction (CRI) using a sliding window centered on each 30 s epoch, guaranteeing sufficient data to capture the changes in autonomic activity (Malik et al 1996). Since some features are computed based on windows which exceed the epoch length, epochs at the start and end of each recording required a special handling: for each such feature, all epochs for which the window crosses the boundaries of the recording were marked as invalid; the feature values for these epochs were interpolated during post-processing using spline fitting (section 2.2.4).

Cardiac features.
Considering cardiac activity, 86 cardiac features were computed from the QRS complexes detected in the ECG signal. Time domain features, computed over nine epochs, include mean heart rate, mean heartbeat interval (detrended and non-detrended), standard deviation (SD) of heartbeat intervals, difference between maximal and minimal heartbeat intervals, root mean square and SD of successive heartbeat interval differences, and percentage of successive heartbeat intervals differing by >50 ms (Malik et al 1996, Redmond et al 2007. We also computed the mean absolute difference and different percentiles (at 10%, 25%, 50%, 75%, and 90%) of detrended and non-detrended heart rates and heartbeat intervals (Ylmaz et al 2010, Willemen et al 2014) as well as the mean, median, minimal, and maximal likelihood ratios of heart rates (Basner et al 2007). In the frequency domain, the features included the logarithmic spectral powers in the very low frequency band (VLF) from 0.003 to 0.04 Hz, in the low frequency band (LF) from 0.04 to 0.15 Hz, in the high frequency band (HF) between 0.15 to 0.4 Hz, and the LF-to-HF ratio (Busek et al 2005), where the power spectral densities were estimated over nine epochs. The spectral boundaries were adapted to the corresponding peak frequency, yielding their boundary-adapted versions (Long et al 2014c). We also computed the maximum module and phase of HF pole  and the maximal power in the HF band and its associated frequency representing respiratory rate (Redmond et al 2007). Features describing non-linear properties of heartbeat intervals were quantified with detrended fluctuation analysis (DFA) over 11 epochs (Kantelhardt et al 2001) and its short-term (α 1 ), long-term (α 2 ), and all time scaling exponents (Iyengar et al 1996, Penzel et al 2003, progressive DFA with non-overlapping segments of 64 heartbeats (Telser et al 2004), windowed DFA over 11 epochs (Adnane et al 2012), and multi-scale sample entropy (MSE) over 17 epochs (length of 1 and 2 samples with scales of 1-10) (Costa et al 2005). Approximate entropy of the symbolic binary sequence that encodes the increase or decrease in successive heartbeat intervals over nine epochs was also calculated (Cysarz et al 2000). In addition, we propose new features based on a visibility graph (VG) and a difference VG (DVG) method to characterize HRV time series in a two-dimensional complex network where samples are connected as nodes in terms of certain criteria (Lacasa et al 2008, Long et al 2014a. The network-based features, computed over seven epochs, comprised the mean, SD, and slope of node degrees and number of nodes in VG-and DVG-based networks with a small degree (⩽3 for VG and ⩽2 for DVG) and a large degree (⩾10 for VG and ⩾8 for DVG), and assortativity coefficient in the VG-based network (Shao 2010, Long et al 2014a, Zhu et al 2014.

Respiratory features.
Concerning respiratory activity, 44 features were derived from RIP signals. In the time domain, we estimated the variance of the respiratory effort signal, the respiratory frequency and its SD over 150, 210, and 270 s, the mean and SD of breathby-breath correlation, and the SD in breath length (Redmond et al 2007). One of our previous studies (Long et al 2014d) introduced respiratory amplitude features for sleep stage classification, including the standardized mean, standardized median, and sample entropy of respiratory peaks and troughs (indicating inhalation and exhalation breathing depth, respectively), median peak-to-trough difference, median volume and flow rate for complete breath cycle, inhalation, and exhalation, and inhalation-to-exhalation flow rate ratio. These features were adopted in this work. Besides, we also computed the similarity between the peaks and troughs by means of the envelope morphology using a dynamic time warping (DTW) metric (Berndt and Clifford 1994). From the respiratory spectrum, the respiratory frequency and its power, the logarithm of the spectral power in VLF (0.01-0.05 Hz), LF (0.05-0.15 Hz), and HF (0.15-0.5 Hz) bands, and the LF-to-HF ratio were estimated (Redmond and Heneghan 2006). Respiratory regularity was measured by means of sample entropy over seven epochs (Richman andMoorman 2000, Long et al 2014d) and self-(dis)similarity based on DTW and dynamic frequency warping (DFW) (Long et al 2014b) and uniform scaling  were derived. The same network analysis features as for HRV were also computed for breathto-breath intervals.

Cardiorespiratory interaction features.
Numerous studies have shown that the interaction between cardiac and respiratory activity varies across sleep stages (Ichimaru et al 1990, Cysarz et al 2004, Long et al 2014a. The power associated with respiratory-modulated heartbeat intervals was quantified over windows of nine epochs (Ichimaru et al 1990). In addition, we also extracted the VG-and DVG-based features for CRI (Long et al 2014a). These resulted in a total of 12 CRI features in our feature set.

Feature post-processing.
In order to reduce the impact of physiological differences and equipment-related variations from subject to subject, the features of each subject were first Z-score normalized by subtracting their mean and dividing by their SD. Further, it is known that the sleep pattern of healthy adults progresses with several cycles throughout the night (Carskadon and Dement 2011). For example, REM and NREM sleep alternate with 4-6 cycles of about 90-110 min with deep sleep usually dominating the NREM periods during the first half of the night. This suggests that the autonomic physiological response with its associated sleep stage is time-variant across the night for each subject. For this reason, we were motivated to smoothen each feature for each subject by means of a cubic spline fitting method (De Boor 2001). This is also expected to help reduce signal measurement noise and variability within subjects for each sleep stage conveyed by the feature values. The latter can be caused by body movements, conscious breathing control, internal physiological variations, or other external factors such as changes in environmental noise and temperature during bedtime sleep. Instead of other simpler low-pass filters, spline fitting was chosen since it can interpolate feature values which could not be computed, for example due to motion artifacts (about 10% observed in our data set) or at the start and end of each recording for features computed with windows exceeding the epoch duration. This procedure allows all epochs in each recording to be classified.
Let t represent a sequence of feature values v = {v 1 , v 2 , ..., v n } at their corresponding time (or epoch) indices t = {t 1 ,t 2 , ...,t n } (in 30 s), then a relation between them can be modeled by where h is a smoothing (spline) function, ε i are independent and identically distributed residuals. The smoothing function can be estimated by minimizing the objective function (i.e. penalized sum of square) such that where λ is a smoothing parameter that controls the trade-off between residual and local variation. The smoothing function can be expressed by cubic B-splines as basis functions and determined via least squares approximation (Unser 1999, De Boor 2001. For a specific overnight recording with a total of m epochs, it is divided in s continuous segments ( = ⌈ ⌉ s m n / ), designated as smoothing splines. Each segment can then be modeled by the spline function, yielding a general spline fitting for the epochs over the entire recording. n represents the smoothing window size where a larger n translates to a smoother fitting curve. In this work, a window size of nine epochs for modeling splines was experimentally found to be appropriate for the task of sleep stage classification.

Classifier
This work used a multi-class Bayesian linear discriminant with time-varying prior probabilities (Redmond et al 2007), similar to that used in previous work (Long et al 2014d). For each epoch, the selected class (D, L, R, or W) is the class ω i that maximizes the posterior probability given an feature vector x (Duda et al 2000), with the the discriminant function g i for each class given by where μ i is the average feature vector for class i, Σ is the pooled covariance matrix for all classes, and ω ( ) t P , i is the prior probability for class i at time (since lights off) t. All parameters were estimated during training.

Feature selection
To select the final list of features we used a wrapper feature selection method based on sequential forward selection (SFS) (Whitney 1971) using as criterion the Cohen's kappa coefficient of agreement κ (Cohen 1960) on the training set. This measure of agreement between the classification predictions and the ground-truth annotations is more adequate than traditional measures of accuracy for this problem since there is a strong imbalance between classes (L epochs, for example, account for more than 50% of all epochs in the data set) and this coefficient factors out chance agreement, compensating for class imbalance.
In many machine learning studies supervised feature selection is often applied on the entire data set, even if the training and validation are kept separate (for example using cross-validation). This common pitfall is known to introduce a bias in the evaluation of a classifier's performance, which will often be overestimated (Smialowski et al 2010). Although keeping a hold-out set for validation would solve this problem, the limited size of the data set would either mean that the model learning would be based on potentially insufficient examples, or that the classifier would be evaluated on a very small sample, potentially unrepresentative of the problem at hand. Instead, the feature selection procedure was executed by strictly separating, on an iterative procedure akin to cross-validation, the training and validation sets. For each iteration, unbound SFS was applied using as criteria the classification performance obtained in the training set of each iteration. The final number of features was chosen as the smallest number S that yield a certain percentage of the maximum training kappa obtained across all iterations. The final list of selected features was chosen as the S features most often selected during the process.
The discriminative power of selected features was evaluated with the absolute standardized mean distance (ASMD) between the feature values of two classes, computed as where x 1 and x 2 are the sample means for class 1 and 2 and σ is the pooled sample SD.

Validation and evaluation
After feature selection is performed and the set of features is chosen, the classification results per subject were evaluated using a ten-fold cross-validation procedure. The kappa coefficient for all subjects in the data set as well as the average and pooled performance were then calculated. In addition to the kappa coefficient, we also computed the traditional metric accuracy, i.e. the percentage of correctly identified epochs. For kappa and accuracy, the results were computed both after pooling the predictions over all epochs of all subjects and after averaging the performance for each subject.

Results and discussion
3.1. Feature selection Figure 1 indicates the kappa coefficient obtained in the training set of each iteration of the feature selection procedure, for a varying number of features. As illustrated, the maximum average training performance is obtained for 105 features, with an average kappa of 0.58. Also clear in the figure, is a plateau in performance between 70 and 100 features. This suggests that the number of features can be greatly decreased without affecting the training performance. A small feature set is often desirable to prevent over-fitting to the training data, as long as it is not so small that the model cannot learn the characteristics of the problem. By choosing different operating points in figure 1, we can choose a smaller feature set at the expense of a reduction in training performance. By allowing a reduction of 1% in training performance (from the maximum kappa of 0.58-0.57), we can reduce the number of features from 150 to 80 features (a reduction of 23.8%). Below this number there is a statistically significant (after a Wilcoxon signed-rank test, with p < 0.05) decrease in training performance and further decreasing the number of features will likely lead to a decrease in classification performance after cross-validation. Using as criteria the smallest number of features which does not cause a statistically significant decrease below the maximum training performance, we chose a total of S = 80 features.
After ranking all features by the number of times they were selected during the iterative feature selection procedure and selecting the 80 features with the highest count, we found that all features in the final feature set were selected in at least 5 of the 10 iterations (with a mean count of 7.67) with 14 features having been selected in all 10 iterations. This illustrates the robustness of the modified SFS method: despite their simplicity and computational efficiency, sequential selection algorithms are known to suffer from a so-called 'nesting effect', potentially leading to sub-optimal feature sets (Pudil et al 1994). By iteratively performing several unbound SFS searches on different training sets and keeping only the features that are selected most often, this effect was reduced, as attested by the large number of iterations each feature in the final set was selected. For brevity only the 14 features selected in all iterations will be discussed further. Table 1 indicates the discriminative power of each feature using the pooled ASMD. It was computed for each pair of classes after aggregating the feature values for all subjects and also the 90th percentile of the ASMD (in parenthesis) obtained for each feature, for all individual subjects. Pooled ASMD values below 0.5 were omitted and 90th percentile ASMD values below 1 were omitted.
The top features are clearly discriminative for different pairs of classes which helps explain the relatively large number of features selected. Additionally, it is interesting to observe that there is one feature (median likelihood ratio) which does not have a pooled ASMD above 0.5 for any class pair. However, its 90th percentile ASMD value is larger than 1 for the pairs D/W and L/W. This is a good example of a feature which is discriminative for only a subset of the subjects (at least 10%) but not for all subjects. The fact that it was selected in every single iteration using the wrapper method described in section 2.4 suggests that it is complementary to other features for some subjects, helping raise the overall training performance.
A note should be made regarding long-term cardiac features such as α 2 or larger-scale MSE features. None of these features were part of the final set of features selected with our method. One possible explanation for this is related to the length of the time series used to compute them. The choice of window sizes (for MSE, 17 epochs, i.e. 8.5 min) represents a compromise between having as much data as possible to accurately calculate the features, while at the same time not exceeding the average length of a given sleep stage (in our data set, the average length of deep and REM sleep periods was found to be 5.1 and 8.7 min, respectively). Although theoretically the window sizes we used are sufficient to calculate these features (a window of 17 epochs corresponds, at an average heart rate of 60 bpm, to 510 samples), it has been suggested that the estimation of sample entropy is low in series shorter than 10 m (where m is the pattern length, in samples) (Richman andMoorman 2000, Yentes et al 2013). This means that for a pattern length of m = 2 and scales higher than 5, the coarse-grained time series used to calculate the sample entropy will have less data points than the suggested limit and the features might not be accurate, and therefore, not representative of the autonomic characteristics of different sleep stages. Table 2 indicates the overall classification performance obtained after 10-fold cross-validation using the selected set of 80 features. In addition, it indicates the performance per class, obtained by considering each class as the positive class and merging the remaining in a single negative class. The highest performance is obtained for R detection, followed by W. The lowest performance is obtained for L. This is further confirmed by the confusion matrix of table 3 which shows that the largest proportion of errors occurs when trying to distinguish L from the other classes. For all other classes, the percentage of misclassified epochs (relative to the total number of epochs) is below 1% except for L.

Cross-validation
In order to evaluate the performance of the classifier in a three-class task (WRN), classes D and L were merged in a single N (non-REM) class. Table 2 indicates the resulting performance. Analyzing the performance of the classifier we see that the classification performance rises substantially, to a kappa of 0.56 and an accuracy of 80%. This was expected since a large number of classification errors occurred between D and L, and in a WNR task these two classes no longer need to be distinguished. Figure 2 illustrates examples of predicted hypnograms, as compared with the reference, for three subjects in the data set: the subject with the worst performance (with κ = 0.17), with the median performance (with κ = 0.50) and with the best performance (with κ = 0.69). A possible Note: the features are described in section 2.2. The pooled ASMD was computed for each pair of classes after aggregating the feature values for all subjects (values below 0.5 were omitted); The 90th ASMD percentiles (in parentheses) were obtained after computing the ASMD of each feature, for each subject (values below 1 were omitted) Note: the pooled performance was computed after aggregating all epochs of all subjects. The mean and SD were calculated based on the performance for each individual subject explanation for the poor performance obtained for the worst subject is that the model trained with the characteristics of the general sample population does not fully capture this subject's cardiac and respiratory expression of different sleep stages. However, despite the low kappa coefficient, the predicted hypnogram still exhibits some correct features, namely, most REM intervals were detected, albeit with the incorrect length, and the two deep sleep periods were also detected. As the performance improves, we see that the predicted hypnograms match better the characteristics of the reference hypnogram, and in the best case the most obvious mistakes are in the missed detection of brief periods of wake during the night while the rest of the sleep stages are correctly predicted. This is likely caused by the use of spline smoothing during feature post-processing, which is adequate to capture the slow-changing characteristics of most sleep stages, but penalizes short, abrupt changes such as brief periods of awakening. Table 4 compares the results of our work with other studies reported in literature. As indicated, only a few studies focused on WRLD classification based on cardiac and/or respiratory signals and our results are amongst the best performing. The first observation is that the results of our previous work , which used only respiratory features, are worse than those produced in the present work, indicating that combining cardiac and respiratory activity can lead to an improved classification performance. The study of Hedner et al (2011) achieved similar results but they used more signal modalities including peripheral arterial tone, actigraphy, and pulse oximetry. The recent study by Willemen et al (2014) also achieved a good performance, although it was validated with a younger sample population, excluded 12% of the epochs from validation and used a basis of 60 s epochs instead of the standard scoring basis of 30 s which makes the results incomparable.

Comparison with state-of-the-art
For WRN classification we see that, to the best of our knowledge, our results also outperform those reported in almost all of the previous studies. In comparison with one of the best performing studies (Domingues et al 2014), we obtain a higher accuracy (albeit a slightly smaller kappa) but require one less modality (actigraphy). Regarding the work of Willemen et al (2014) it is again important to note that the results in that study were obtained on basis of 60 s epochs.
These results suggest that our choice for a Bayesian linear discriminant was appropriate for this task. Besides its simplicity, it offers the benefit of a probabilistic framework which allows, for instance, the direct use of time-varying prior probabilities to improve classification. In comparison with increasingly popular black-box approaches, this classifier has the additional advantage that it does not require the tuning of critical parameters such as kernels for support vector machines, number of nodes in classification trees or number of hidden layers in neural networks.

Conclusions
This paper presents a method to identify overnight sleep stages using cardiorespiratory features extracted from ECG and RIP signals. These features were post-processed by means of subject-specific Z-score normalization and spline smoothing, which helps reduce the influence of signal noise, between-subject, or within-subject variability in autonomic physiology. Eighty features were selected from a set of 142 features using a modified SFS-based feature selector designed to avoid biasing the validation performance. Using a linear discriminant classifier in a ten-fold cross-validation procedure, the classification results (for both the fourclass WRLD and three-class WRN classification tasks) achieved in this work (table 4) outperform most of the previous studies.
As future work it would be interesting to investigate whether this methodology would achieve a comparable performance in subjects with sleep disorders such as sleep apnea or insomnia. If successful, and given its potential for unobtrusive, prolonged use at home, it could represent a significant step towards lowering the costs and complexity of home tests and thus complement the current medical practices for diagnosis of sleep disorders.