3.3. Evaluating staging/signal quality (SOAP)

As described here, Luna's SOAP command can quickly evaluate whether there are gross misalignments or other inconsistencies between staging and signal (typically, but to necessarily EEG) data.

In brief, SOAP tries to predict existing staging given one or more (EEG) signal. If the prediction within a single individual is poor (a low kappa), this suggests that the signal(s) and/or staging annotations are problematic (i.e. assuming the signals are normally expected to covary with true sleep stage).

Internally, SOAP performs a spectral analysis (Welch PSD) per epoch, then applies dimension reduction (PCA) over the epochs. It selects the components that are associated with stage labels. It then fits a linear discriminant analysis (LDA) model to predict stage given the PCs (just for that one individual), and compares the fitted to the observed values to obtain measures of accuracy (e.g. kappas, etc).

Note that this is a qualitatively different exercise than automated staging (which we address in the next segment). Here we are using a "prediction-like" approach only to assess the general association between existing staging and signals. As such, the kappa should not be interpreted in the same manner as is typically done for automated staging (true prediction).

Evaluating stage consistency

Here we'll run SOAP for a single channel (C3):

luna harm1.lst -o out.db -s ' SOAP sig=C3 '

Restrictions using SOAP

In a previous step, we replaced one v2 recording (F01) that only had N2 epochs with the correct, original version: if we hadn't done this, we'd have to skip F01 here, as SOAP would fail, as the SOAP model relies of intra-individual variation between different stages to build a model.

Also note that we've resolved issues with misaligned staging already: if we hadn't done this, we'd need to add EPOCH align before running SOAP.

We now extract the per-channel output for a subset of variables; (the -p 2 option restricts output to two decimal places):

destrat out.db +SOAP -r CH -v NSS K K3 ACC ACC3 -p 2

ID   CH   ACC  ACC3     K    K3   NSS
F01  C3  0.86  0.92  0.80  0.84  5.00
F02  C3  0.64  0.76  0.42  0.44  5.00
F03  C3  0.46  0.68  0.26  0.40  5.00
F04  C3  0.50  0.71  0.16  0.17  5.00
F05  C3  0.85  0.92  0.79  0.83  5.00
F06  C3  0.85  0.93  0.77  0.82  5.00
F07  C3  0.88  0.92  0.82  0.87  5.00
F08  C3  0.78  0.88  0.70  0.76  5.00
F09  C3  0.90  0.95  0.83  0.87  5.00
F10  C3  0.82  0.89  0.75  0.79  5.00
M01  C3  0.89  0.93  0.78  0.72  5.00
M02  C3  0.91  0.96  0.82  0.90  5.00
M03  C3  0.82  0.89  0.74  0.79  5.00
M04  C3  0.85  0.93  0.80  0.86  5.00
M05  C3    NA    NA    NA    NA -1.00
M06  C3  0.88  0.93  0.82  0.87  5.00
M07  C3  0.87  0.92  0.81  0.86  5.00
M08  C3  0.88  0.95  0.83  0.91  5.00
M09  C3  0.82  0.91  0.74  0.75  5.00
M10  C3  0.81  0.94  0.73  0.84  5.00

The K variables are the kappa statistics for SOAP; the ACC are the corresponding simple accuracy metrics. The variables K3 and ACC3 are the corresponding values for a three-class (NR vs R vs W) classification, instead of the default 5-class one. NSS is the number of unique stages observed (those with at least a minimum required number of observations/epochs to allow SOAP to construct a predictive model, 10 by default, set by req-epochs).

We see that M05 has a value of -1 for NSS and no values for the kappa/accuracy statistics. Here, -1 is a special code that means no components were retained that were sufficiently associated with observed stage, a prerequisite for building the LDA model. If we look at the console output, you'll see the error thrown:

 CMD #1: SOAP
   options: sig=C3
  using LDA for primary predictions
  ------------------------------------------------------------
  fitting SOAP model for single channel C3
  read 1 feature specifications (99 total features on 1 channels) from _1
  set epochs to default 30 seconds, 933 epochs
  set 0 leading/trailing sleep epochs to '?' (given end-wake=120 and end-sleep=5)
  applying Welch with 4s segments (2s overlap), using median over segments
  expecting 99 features (for 933 epochs) and 1 channels
  standardizing X
  final feature matrix X: 99 features over 933 epochs
  of 933 total epochs, valid staging for 933, and of those 933 passed outlier removal
  outlier counts: flat, Hjorth, components, trimmed -> total : 0, 0, 0, 0 -> 0
  re-standardizing X after removing bad epochs
  epoch counts: N1:84 N2:593 N3:83 R:75 W:98
  final epoch counts: N1:84 N2:593 N3:83 R:75 W:98
  final count of valid epochs is 933
  0 p<0.01 stage-associated components, bailing
  *** fewer than 2 non-missing stages for this individual, cannot complete SOAP

By default, SOAP only uses components that are nominally (p < 0.01) associated with staging when running the LDA. With very unusual recordings, it can sometimes happen that no components meet this criterion (which is itself a warning about potentially poor staging). If you see an NSS of -1 and no emitted kappa statistics, you can assume there is a catastrophic problem with the signal/staging consistency (i.e. extreme low kappas). We can force SOAP to build a model by altering the p < 0.01 threshold, e.g. setting to 1 or a very high p-value means that most/all will be included :

luna harm1.lst id=M05  -o out.db -s ' SOAP sig=C3 pc=0.9 '

  retaining 9 of 10 PSCs, based on ANOVA p<0.9

  Confusion matrix: 5-level classification: kappa = 0.00, acc = 0.64, MCC = 1.00

           Obs:   N1      N2     N3      R      W      Tot
  Pred: N1         0       0      0      0      0     0.00
        N2        84     593     83     75     98     1.00
        N3         0       0      0      0      0     0.00
        R          0       0      0      0      0     0.00
        W          0       0      0      0      0     0.00
        Tot:    0.09    0.64   0.09   0.08   0.11     1.00


  Confusion matrix: 3-level classification: kappa = 0.00, acc = 0.81, MCC = 1.00

           Obs:  NR     R     W     Tot
  Pred: NR      760    75    98     1.00
        R         0     0     0     0.00
        W         0     0     0     0.00
        Tot:   0.81  0.08  0.11     1.00

Now we see that kappa/accuracy statistics have been calculated, but the kappas are zero as all epochs have the same prediction (N2). In other words, this is telling us that SOAP was completely unable to model the signal/staging relationship, and so will always output the most common stage label (N2) as the best guess. (This is also a good reminder why we look at kappa statistics and not accuracy values, i.e. accuracy is still 0.81 in the 3-class case.)

Plotting the K kappa statistic (fifth column from above, and setting the NA value to 0 for M05), we can see four individuals who appear to be clear outliers: points colored red F02, F03, F04 and M05:

This suggests that four recordings likely have some issue with their staging/signal consistency, which we'll investigate below.

Indeed, if we consult the details of manipulations, we'll see that all four individuals had some major type of issue with their staging:

F02: staging epochs were shifted by 22 epochs (>10 minutes) relative to the true signal times (e.g. which can be easily caused by the EDF start time being misspecified, etc)
F03 and F04 had their annotation files literally swapped with one another
M05 had all staging information scrambled completely at random

As such, it is interesting to note that SOAP saw M05 as the "most severe" class of problem, with a kappa of 0 (and with it initially refusing even to build a model, returning NSS of -1). This is because the swapped/shifted stagings still broadly approximate a "typical" hypnogram, and so will tend to show a non-trivial level of correspondence to another (typical) sleeper's.

Note that SOAP won't pick up every possible type of problem with staging: for example, M02 has severely truncated staging (about half the night missing), but this class of error is effectively invisible to SOAP. (Of course, this particular issue would be trivial to detect by other means, e.g. simply viewing hypnograms or counting the number of missing/lights-on epochs).

Interpreting SOAP kappa statistics

For SOAP "low" kappa statistics mean roughly below 0.5 or so. This indicates a likely only-modest relationship between the properties of signal (EEG) and staging. As we expect sleep EEG to track with staging (after all, it is the primary basis for determining stage, whether by eye or automatically...), this could be due to a) a bad/noisy/misaligned signal, b) bad/misaligned staging, or c) both. The threshold of 0.5 should not be considered to be fixed in stone, however. For example, individuals with short or unusual sleep periods may yield relatively low kappa values, as might cases where only nonstandard signals are available (e.g. a bipolar montage that doesn't captured stage-dependent activity well).

Epoch-level review

If we add the epoch flag to SOAP, it outputs epoch-level information: the observed stage label (PRIOR) plus the posterior probabilities (PP_N1, PP_N2, etc) and predicted stage (PRED). This is turned off by default as the outputs can get large for very large datasets. We'll also add pc=0.9 so that SOAP emits values for M05:

luna harm1.lst -o out.db -s ' SOAP sig=C3 epoch pc=0.9 '

Now we see an additional E-by-CH stratum in the output:

destrat out.db

--------------------------------------------------------------------------
out.db: 1 command(s), 20 individual(s), 32 variable(s), 187291 values
--------------------------------------------------------------------------
  command #1:   c1  Sat Sep  7 09:45:46 2024    SOAP    epoch sig=C3
--------------------------------------------------------------------------
distinct strata group(s):
  commands  : factors         : levels      : variables 
------------:-----------------:-------------:---------------------------
  [SOAP]    : CH              : 1 level(s)  : ACC ACC3 F1 F13 F1_WGT K K3 MCC
            :                 :             : MCC3 NSS PREC PREC3 PREC_WGT RECALL
            :                 :             : RECALL3 RECALL_WGT
            :                 :             : 
  [SOAP]    : E CH            : (...)       : DISC DISC3 INC PP_N1 PP_N2 PP_N3
            :                 :             : PP_NR PP_R PP_W PRED PRIOR
            :                 :             : 
  [SOAP]    : CH SS           : 6 level(s)  : DUR_OBS DUR_PRD F1 PREC RECALL
            :                 :             :
            :                 :             : 
  [SOAP]    : CH VAR          : 10 level(s) : INC PV
            :                 :             : 
  [SOAP]    : CH SS ETYPE     : 36 level(s) : ACC N
            :                 :             : 
  [SOAP]    : CH NSS PRED OBS : 34 level(s) : N P
            :                 :             : 
------------:-----------------:-------------:---------------------------

Extracting the epoch-level outputs, we can visualize in R:

destrat out.db +SOAP -r E CH > o.1

In R, we'll use a convenience plotting function (lpp2()) from lunaR:

library(luna)
d <- read.table("o.1",header=T,stringsAsFactors=F)

First looking at a "clean" record with a high kappa, e.g. M06:

lpp2( d[ d$ID == "M06" , ]  )

In the above plot, from bottom to top:

the lower row encodes the observed stages (PRIOR) using the usual Luna stage colors (green/blues/red = W/N1,N2,N3/REM)
the middle row encodes the predicted stage (PRED)
the top row shows the posterior probabilties (i.e. PRED is simply the stage with the highest posterior probability)

In the above plot, there is a clear correspondence between the observed stages and what the simple C3-based model is able to recontruct, which is reflected in the high kappa statistics for this individual. Most (N=16) of the plots look something like the above, i.e. with probabilties where a) the maximum is typically near 1.0 for each epoch (high model confidence) and b) the predictive stages tend to broadly align with the observed data, as expected.

In contrast, let's look at the four cases were we had very low kappas:

First, here is M05, for which the stage labels were completely scrambled in our manipulation:

lpp2( d[ d$ID == "M05" , ]  )

Here we see how every epoch is predicted to be N2: the model basically has no information.

Next, we'll look at F03 and F04 - these were both "valid" hypnograms (versus being completely scrambled), but were incorrectly assigned to the wrong individuals:

lpp2( d[ d$ID == "F03" , ]  )

We see a generally "messy" posterior distribution - lot's of low confidence predictions and a lack of alignment. We see a similar picture for F04:

lpp2( d[ d$ID == "F04" , ]  )

Finally, let's look at a more subtle case: for F02, we simply shifted the stage labels by about 10 minutes:

lpp2( d[ d$ID == "F02" , ]  )

Because of the shifting, you can see how the first portion of the lower plot is gray, meaning it is unstaged (and thus not analyzed by SOAP). We see a general lack of alignment -- although, simply by eye it does appear (in this particular case) that the predictions are shifted in time, meaning enough valid epochs must have overlapped to build some kind of valid model. We can leverage this fact to realign signals and staging, as we'll do below.

Multi-channel SOAP

Looking at the patterns of SOAP coefficients across multiple channels can be informative in the hd-EEG case:

if all channels have low kappas, then the problem is likely with the staging
if at least a few channels have high/normal kappas, then the problem is more likely to be with the other low-kappa channels

Thus SOAP can be a useful way to spot bad channels even if the staging information is good on the whole.

Here we'll re-run SOAP on all channels. Note, it may take five minutes or so to evaluate all 1,140 (20 * 57) channels -- you can skip this step without impacting the rest of the walkthrough and just look at the results below, if you wish. We'll omit the epoch-level output.

luna harm1.lst -o out.db -s ' SOAP sig=${eeg} pc=0.9 '

destrat out.db +SOAP -r CH -v NSS K K3 > o.1

In R:

d <- read.table("o.1",header=T,stringsAsFactors=F)

First, we'll check that all individuals were able to have SOAP finish (which should be the case with pc=0.9)

table( d$NSS )

   5 
1139

We can also check that each individual has outputs for the expected (N=57) number of channels:

table( d$ID )

F01 F02 F03 F04 F05 F06 F07 F08 F09 F10 M01 M02 M03 M04 M05 M06 M07 M08 M09 M10 
 57  57  57  57  57  57  57  57  57  57  57  57  57  56  57  57  57  57  57  57

Here we plot the 5-class kappa for all channels/individuals (colors indicate the individual):

plot( d$K , col = as.factor( d$ID ) , pch=20 ,
      xlab = "Indiv/channel" , ylab = "SOAP kappa" , ylim=c(0,1) )
abline(h=0.5)

So, for the four "bad" individuals, we see that all channels have poor kappa scores: the cluster of points are all below (or near) a low kappa threshold of 0.5. From this, we'd conclude it wasn't that C3 was just a very bad channel for some people - rather, the problem likely lies with the staging (as we know to be the case here).

In contrast, this plot does point to a handful of channels that have been flagged as bad - e.g. for M08, the red cluster of points third to the last (rightmost) in the plot above. Here, one of the final channels seems to be an outlier for that individual, falling well below the kappa threshold.

We can identify this channel:

d[ d$ID == "M08" , c("CH","K") ]

      CH         K
969  Fp1 0.8002591
970  Fp2 0.7926766
971  AF3 0.7896391
972  AF4 0.8096110
973   F7 0.8162281
974   F5 0.7973992
975   F3 0.8133728
976   F1 0.8092905
977   F2 0.8110485
978   F4 0.8102376
979   F6 0.8099534
980   F8 0.7954398
981  FT7 0.8198113
982  FC5 0.8235413
983  FC3 0.8178465
984  FC1 0.8159907
985  FC2 0.8191978
986  FC4 0.8230782
987  FC6 0.8175959
988  FT8 0.7998662
989   T7 0.8237364
990   C5 0.8279041
991   C3 0.8369470
992   C1 0.8010708
993   C2 0.8217165
994   C4 0.8177090
995   C6 0.8115075
996   T8 0.8125639
997  TP7 0.8222410
998  CP5 0.8222884
999  CP3 0.8261689
1000 CP1 0.8371216
1001 CP2 0.8141448
1002 CP4 0.8166989
1003 CP6 0.8057053
1004 TP8 0.7921345
1005  P7 0.8207816
1006  P5 0.8199782
1007  P3 0.8212534
1008  P1 0.8001374
1009  P2 0.8050327
1010  P4 0.8108632
1011  P6 0.8129150
1012  P8 0.8081847
1013 PO3 0.8077789
1014 PO4 0.8021840
1015  O1 0.8041398
1016  O2 0.7922199
1017 AFZ 0.7957461
1018  FZ 0.8047130
1019 FCZ 0.8188830
1020  CZ 0.8229660
1021 CPZ 0.8102523
1022  PZ 0.8140062
1023 POz 0.3263965   <--- outlier
1024  OZ 0.7699115
1025 FPZ 0.7938324

If you saved the file tmp/hjorth.1 from the earlier signal review step, we can review it again here:

library(luna)
d <- read.table( "tmp/hjorth.1" , header=T, stringsAsFactors=F)
d$H1 <- log( d$H1 )

Just looking at the plots for H1 and H2, we can see there is indeed something off with POz (the midline channel second to the bottom) for M08:

dd <- d[ d$ID == "M08" , ]           
par(mfcol=c(3,1), mar=c(0.5,0.5,0.5,0.5))
ltopo.xy( dd$CH , dd$E , dd$H1 , z = dd$H1 , pch=20 , col = lturbo( 100 ) , cex=0.2 , xlab="Epoch", ylab="log(H1)" )
ltopo.xy( dd$CH , dd$E , dd$H2 , z = dd$H2 , pch=20 , col = lturbo( 100 ) , cex=0.2 , xlab="Epoch", ylab="H2" )
ltopo.xy( dd$CH , dd$E , dd$H3 , z = dd$H3 , pch=20 , col = lturbo( 100 ) , cex=0.2 , xlab="Epoch", ylab="H3" )

This type of issue will most likely to captured by standard channel/epoch QC procedures (and likely corrected with interpolation) so SOAP is not necessarily the optimal way to catch such issues. Nonetheless, the convergence is reassuring.

Aligning misaligned staging

There is nothing we can do for the scrambled case, M05: as is evident by simply looking at the hypnogram, it needs replacing (or using automated staging, etc). But can we do anything to potentially salvage the manual staging for the other three recordings?

First we'll consider F02. If we have a low SOAP kappa but the predictions look "shifted" - or, more likely, if we know that the timing information has been lost on the staging (e.g. it has nonsensical time-stamps starting in the mid-morning, when the EDF starts in the evening, etc), then we can turn to Luna's PLACE as described in this vignette.

PLACE requires a simple .eannot style text file of the stage labels: these stages will be aligned to the epochs in the EDF. We can use the STAGE command (which is just a minimal version of HYPNO epoch) to output these stage labels:

luna harm1.lst id=F02 -s STAGE min > stg.txt

We can now run PLACE, passing stg.txt to the required stages option and telling it to write any new (realigned) solution to new.eannot:

luna harm1.lst id=F02 -o out.db -s PLACE sig=C3 stages=stg.txt out=new.eannot

As described in the vignette, PLACE shifts the annotations, one epoch at a time, to find the optimal alignment (highest kappa value). In the console log, it outputs:

  read 861 epochs from stg.txt
  based on EDF, there are 861 30-s epochs
  requiring 0.1 proportion of EDF, and 0.5 of supplied stages overlap

  ...

  optimal epoch offset = -22 epochs (kappa = 0.774458)
  which spans 839 epochs (of 861 in the EDF, and of 861 in the input stages)

We can extract the fit for all possible epochs:

destrat out.db +PLACE -r OFFSET > o.1

In R:

d <- read.table("o.1",header=T,stringsAsFactor=F)
plot( d$OFFSET , d$K , type="n" , xlab="Offset (epochs)" , ylab="Kappa" , xlim=c(-500,500) )
abline(v=c(0,-22), col=c("gray","red") )
lines( d$OFFSET , d$K , type="l" ,  col="darkgreen" ,lwd=2)

All other things being equal, when the output of PLACE looks like this (i.e. a sharp peak at one high-kappa value), you can be confident this is a likely correct solution (especially if you had reason to believe the stage timings were misplaced, as can sometimes happen when working with archival data, etc). Zooming into the actual values plotted around the peak at an offset of -22 epochs, we can see how dramatically it drops off - i.e. PLACE can resolve the alignment to a single epoch (although this depends on the length and nature of the recording):

destrat out.db +PLACE -r OFFSET/-25,-24,-22,-21,-20 -v K K3 -p 2

ID   OFFSET       K      K3
F02     -25    0.68    0.65
F02     -24    0.71    0.70
F02     -22    0.77    0.78
F02     -21    0.71    0.73
F02     -20    0.70    0.69

The file new.eannot contains new annotations; note, the last 22 rows are set to ? as the stages been been shifted "backwards" in time by 22 epochs. So, we've "solved" the issue with F02.

Matching swapped annotations

What about the two remaining individuals, F03 and F04? If we suspect that files have been swapped, we can in principle use SOAP to figure this out.

Use of SOAP

This is definitely an edge case usage of SOAP, that might not be practical in large samples. At a certain point, it would probably be cleaner to use automated staging to restage files, or even redo the manual staging. We nonetheless include this as an illustration of how SOAP works. There may also be cases where it is not easy to perform automated stages (e.g. unusual EEG montages, or even the lack of any EEG signals). In principle SOAP is agnostic to channel type and could work with other signals that covary with sleep stages, e.g. the EOG, EMG or ECG even. It is beyond the scope of this walkthrough to address these use cases, however.

Here we iterate through each of the 20 recordings but tell Luna to: a) ignore that individual's real annotations (with skip-sl-annots=T meaning to skip annotation files defined in the sample list) and b) to instead read annotations from F03 and apply them to each EDF.

As our .annot files are encoded in elapsed seconds, not clock times, it means that any file swaps might not have been self-evident; it also let's us swap them in in this simple way (although in this particular case, all recordings were forced to have the same 10pm start, and so this point is moot).

We can run for all EDFs, asking which of them seems to match the staging in the file F03.annot the best, if not F03 itself?

luna harm1.lst skip-sl-annots=T annot-file=work/harm1/F03.annot \
  -o out1.db -s SOAP sig=C3

Extracting and plotting the results (plotting code not shown):

destrat out1.db +SOAP -r CH -v K3 K

Although the plot colors F03 in dark red the only high kappa in fact belongs to F04 (bright red), suggesting that the annotations in F03.annot better match the signals in F04.edf.

Is the converse true? Given that F04 also had low original kappas, perhaps the staging in F03.annot better matches the signals in F04.edf? Again, we'll run this lining up F04.annot against all N=20 EDFs:

luna harm1.lst skip-sl-annots=T annot-file=work/harm1/F04.annot \
  -o out2.db -s SOAP sig=C3

As in the first instance, although the nominal "owner" of these annotations (F04) is dark red, the only EDF with a high kappa score is F03 (bright red). So, in this particular case, it appears we've detected a straight swap between two individuals.

This may be a slightly contrived example, but our experiences working with many datasets suggests that these and similar types of errors certainly can and do occur.

Summary

In this step, we used SOAP to check staging data:

we assessed the general quality of staging for most individuals (which was good)
we identified a handful of individuals with unusually low SOAP metrics
we used SOAP and its friend PLACE to resolve some of these edge cases (involving swapped or shifted staging)
we identified a handful of channels likely containing strong artifacts, leading to low SOAP values just for those channels

In this walkthrough we have manual staging available which -- having fixed some clear errors -- seems of generally good quality. Still, for illustration of Luna's functions, in the next section we'll next look at running automated staging.

Soap versus automated staging: detecting versus correcting problems

As noted above, SOAP can be a quick, self-contained method to apply as a QC step. It likely has more applications to assess overall quality and flag potential problems, rather than using it for follow up detective work as we did here. If there are problems with existing staging, it will likely be simpler to just redo the staging, ideally via automated methods.

Of course, if manual staging exists, one could also use automated staging to provide a QC check of the manual staging too. Typically in the context of automated staging, existing manual staging is assumed to be the gold standard. As we've seen here, however, that is not necessarily the case. Indeed, as automated stagers get better, for good quality signals low kappas from automated staging will more likely reflect problems with the manual staging, rather than poor performance of the automated stager.

Still, using automated staging to evaluate existing staging may not be straightforward, as it brings an additional layer of assumptions, e.g. about the transferability of the model, as well as signal qualities. As such, SOAP can still provide a useful, simpler approach to staging evaluation.