1.6. Reviewing and harmonizing annotation labels

We now have a set of EDFs in work/harm1 that should all have similar labels, units, sampling rates and referencing schemes (we'll check this later). Next, we turn to harmonizing the annotations.

Tabulating annotations

Although Luna internally treats all annotations similarly once loaded, it can be advantageous to have similar file formats across different annotation files (e.g. if you want to load them into other software). We use the standard .annot format, so one step will involve reading then writing annotations to this format in a uniform manner.

We also need to handle the arbitrary differences in stage labels across the different studies use, adopting an approach conceptually similar to the aliasing done for channel labels.

We can look at the list of annotations by using the ANNOTS command:

luna s1.lst -o out.db -s ANNOTS

destrat out.db +ANNOTS -r ANNOT  | head

 ID    ANNOT  COUNT     DUR
F01  Arousal     26   255.2
F01       N2    361   10830
F02  Arousal     97  1096.6
F02       N2    431   12930
F02       N1     36    1080
F02       N3    145    4350
F02        R    180    5400
F02        W     69    2070
F03       N2     40    6720
...

The following extracts the class labels from all 20 individuals (column 2) and enumerates them: (output ordered by stage)

destrat out.db +ANNOTS -r ANNOT | cut -f2 | sort | uniq -c

  15 N1
  16 N2
  15 N3
  13 R
  13 W

   2 0
   2 1
   2 2
   2 3
   2 5

   2 SR
   2 SW

   2 SlpStg1
   2 SlpStg2
   2 SlpStg3
   2 SlpStgREM
   2 SlpStgWake

  10 Arousal

That is, we see the most common forms are N1, N2, N3, R and W, where the last two are REM and wake respectively, seen in more than half the individuals. We then see some alternate forms: e.g SlpStg1 which is equivalent to N1, etc. We also see a commonly used numeric encoding of sleep stage: 1,2,3 as N1, N2, N3, 5 as REM, and 0 as wake.

When we look at the individual annotation files, we might notice some things missing, however. For example, M05 has lower-case stage annotations in an .eannot file:

head work/data/annots/M05.eannot

n2
n2
n3
w
n1
w
n2
n2

We also see that F08.annot has S1, S2 and S3 encodings too:

...
S2      .       .       01:26:30        01:27:00        .
S2      .       .       01:27:00        01:27:30        .
SW      .       .       01:27:30        01:28:00        .
Arousal .       .       01:27:32.5      01:27:49.7      .
S2      .       .       01:28:00        01:28:30        .
S2      .       .       01:28:30        01:29:00        .
S2      .       .       01:29:00        01:29:30        .
S2      .       .       01:29:30        01:30:00        .
S2      .       .       01:30:00        01:30:30        .
SR      .       .       01:30:30        01:31:00        .
SR      .       .       01:31:00        01:31:30        .
SR      .       .       01:31:30        01:32:00        .
SR      .       .       01:32:00        01:32:30        .
SR      .       .       01:32:30        01:33:00        .
SR      .       .       01:33:00        01:33:30        .
...

Automatic stage remapping

What happened to those other stage annotations above (e.g. S1 etc)? This reflects a default feature of Luna, to map some commonly encountered terms to the standard stage labels N1, N2, N3, R and W. The mapping is hard-coded and so naturally is not able to guess all possible terms (e.g. why SW and SlpStg1 etc are not mapped). For example, the current set of automatic terms that are mapped to N1 include:

 NREM1
 Stage1
 S1
 Stage 1 sleep|1
 SRO:Stage1Sleep
 SDO:NonRapidEyeMovementSleep-N1

These are largely based off terms encountered across various NSRR studies. Mappings are case-insensitive too, which is why n1 is mapped to N1 internally.

You can turn off this behavior by setting the annot-remap=F special variable (again, the output below has been sorted to group things logically):

luna s1.lst -o out.db annot-remap=F -s ANNOTS

destrat out.db +ANNOTS -r ANNOT | cut -f2 | sort | uniq -c

   8 N1   
   6 S1
   2 SlpStg1
   2 1
   1 n1

   9 N2   
   6 S2   
   2 SlpStg2
   2 2
   1 n2

   8 N3
   6 S3
   2 SlpStg3
   2 3
   1 n3

   8 R
   4 REM
   2 SR
   2 SlpStgREM
   2 5
   1 r

   8 W
   2 SW
   2 SlpStgWake
   4 Wake
   2 0
   1 w

Now we see the original labels (i.e. identical to the input files) which makes the structure clearer.

Arousals and multiple annotation files

Note, there is also an Arousal annotation in some individuals, denoting manually scored arousals. These annotations are lost in the .eannot format, which only accepts epoch-level codes, however. Note that multiple annotation files (potentially of different formats) can be associated with the same EDF however, and so using .eannot to represent staging doesn't preclude including other information too.

New annotation files

Next, we'll generate a mapping file to make all annotations consistent across individuals. Although some (e.g. S1) are mapped automatically, we'll include the terms here for reference anyway; we won't add the lower-case variants however, as all annotation labels are case-insentive.

Using the same primary|alt1|alt2|... form, we'll make a two-column tab-delimited file (which should already exist in the demonstration folder: work/data/auxiliary/amaps) to read as follows:

remap   N1|1|S1|SlpStg1
remap   N2|2|S2|SlpStg2
remap   N3|3|S3|SlpStg3
remap   R|REM|5|SR|SlpStgREM
remap   W|Wake|0|SW|SlpStgWake

The remap special option is the equivalent for alias but for annotation labels. (Note: if an alternate annotation itself has a | character, you need to quote (") the entire annotation, as pipes are used as delimiters in the above.)

We can check this works as expected (ignoring the Arousal annotation in the output here). First we re-run ANNOTS but including the remapping definitions from amaps:

luna s1.lst -o out.db @work/data/auxiliary/amaps -s ANNOTS

and then we extract, tabulate and count the class labels across all 20 individuals:

destrat out.db +ANNOTS -r ANNOT | cut -f2 | sort | uniq -c

That is, we now see only five distinct stage annotation labels; whereas the N2 label is present in all 20 individuals, the other labels are only present in 19 of the 20. For typical whole-night PSGs, this would presumably be strange: one would expect at least some wake and other stages including REM. As we'll see later, this in fact reflects one of the manipulations of this dataset (for F01), so we'll revisit this observation later.

We can now make a set of new reformated and remapped annotation files, which we'll write to the same work/harm1 folder as the EDFs. We do this by combining the remapping stage (the remap terms @-included from amaps) with the WRITE-ANNOTS command, which ensures a consistent (.annot) format is applied across recordings:

luna s1.lst @work/data/auxiliary/amaps -o out.db -s ' WRITE-ANNOTS file=work/harm1/^.annot '

(As a reminder, the ^ character swaps in the ID of that dataset; as some shells (e.g. zsh) interpret it as a special character, we've put the Luna command within single-quotes, which stops the shell from interpreting it -- more details can be found here.)

At this stage, we've now populated the work/harm1/ folder with 20 new EDFs and 20 new .annot annotation files, and are ready to generate a new sample list for this project next.