Skip to content

Step 0: project preparation

This page provides notes on obtaining the data and tools necessary for the walkthrough, as well as some general orientation.

Environments in the walkthrough

The walkthrough uses several environments:

  • the shell (i.e. a command-line or terminal, assumed to be bash) to run Luna and perform basic file manipulation tasks

  • R to analyse and visualize outputs from Luna

  • optionally, Python via JupyterLab for viewing raw data (a parallel walkthrough based entirely in Python/JupyterLab will be described elsewhere)

Commands that should be entered on the shell prompt (including Luna statements) are shown with a gray-blue background: e.g.

luna s.lst -o out.db < cmd.txt

Commands that should be executed in R are shown with a sage background: e.g.

library(luna)
k <- ldb( "out.db" )

All outputs (whether from the shell, Luna or R) are shown with a gray background: e.g.

F01.annot   F05.annot   F09.annot   M03.tsv     M07.eannot
F02.annot   F06.annot   F10.annot   M04.tsv     M08.eannot
F03.annot   F07.annot   M01.csv     M05.eannot  M09.xml
F04b.annot  F08.annot   M02.csv     M06.eannot  M10.xml


In various places throughout the walkthrough, we may refer to performing a set of commands in a particular context but not always repeat the full instructions:

Term Description
...in the shell... or ...on the command line... implies use bash (e.g. Terminal App in macOS, or Terminal in JupyterLab), e.g. to run luna or destrat directly
...in R... or ...using lunaR... implies opening R and running library(luna)
...using lunapi... or ...in Python... implies opening Python (i.e. via JupyterLab) and running import lunapi as lp

Data

The example data (20 whole-night hd-EEG studies) are available from the National Sleep Research Resource (NSRR). Put in an application to access the Luna/GRINS walkthrough dataset

Accessing these data via NSRR

There may be a delay in the tutorial files being fully available via NSRR, and so they may not be posted yet; please contact luna.remnrem@gmail.com in the meantime for advice on the timeline.


After getting access and downloading these data (potentially extracting the contents of the archive), you should see a single folder entitled orig/, with three sub-folders, v1, v2 and aux: The files in aux/ are used at various places in the walkthrough and will be described at that point. Not all files in aux are listed here, only some key ones.

Folder Contents
orig/v1 Original version of the data
orig/v1/edfs Original EDFs
orig/v1/annots Original annotation files
orig/v2 Manipulated version of the data
orig/v2/edfs Manipulated EDFs
orig/v2/annots Manipulated annotation files
orig/aux/ Auxiliary datafiles used in the walkthrough
orig/aux/master.txt Basic demographic information (age, sex)
orig/aux/amaps Mapping file for annotations
orig/aux/cmaps Mapping file for channels
orig/aux/badchs.txt List of channels to impute
orig/aux/clocs Channel location information
orig/aux/models/ Subfolder with age-prediction model files
orig/aux/pops/ Subfolder with POPS model files

Project set-up

We assume all analyses occur in a working folder that contains the above orig/ folder as well as a folder named work; you can create this from the shell:

mkdir -p work/data work/harm1 work/harm2 work/clean
   ./ (current folder)
   |
   |---> orig/
   |      |---> v1/
   |      |---> v2/
   |      |---> aux/
   |
   |---> work/
          |---> data/
          |---> harm1/
          |---> harm2/
          |---> clean/

As created above, work/ will contain four sub-folders, which we'll use to store different iterations of the dataset as it goes through quality control (QC):

  • data: a simple copy of the original folder orig/v2 (i.e. the manipulated original files)

  • harm1: new EDFs and annotations created that are harmonized for basic properties (file format, harmonized labels, etc), based on following step 1 of this demonstration

  • harm2: further harmonized EDFs and annotations, with changes made to give consistent sample rates, units, EEG polarities for a set of standard (ungapped) EDFs, based on following step 2 and step 3 of this demonstration

  • clean: a final, analysis-ready cleaned dataset, following epoch-level artifact correction as described in step 4 of this demonstration

To follow the walkthorugh, commands should be executed in the working (current folder above) that contains both orig/ and data/.

Now we'll populate the new work/data/ folder with the core elements needed from orig/v2, here listing out the folders/files explicitly (this step may take half a minute or so):

cp -r orig/v2/edfs orig/v2/annots orig/aux work/data/

ls -R work/data
annots  aux edfs

work/data/annots:
F01.annot   F05.annot   F09.annot   M03.tsv     M07.eannot
F02.annot   F06.annot   F10.annot   M04.tsv     M08.eannot
F03.annot   F07.annot   M01.csv     M05.eannot  M09.xml
F04b.annot  F08.annot   M02.csv     M06.eannot  M10.xml

work/data/aux:
amaps           female.ids      models
badchs.txt      file.txt        n106.psd.n2.proj
clocs           lm.sigs         pops
cm.sigs         male.ids        specs.json
cmaps           master.txt      step5.sh

work/data/aux/models:
m1-adult-age-data.txt       m1-adult-age-features.txt   m1-adult-age-luna.txt

work/data/aux/pops:
s2.conf     s2.priors   s2.rspec2.svd   s2.spec2a.svd
s2.ftr      s2.ranges   s2.spec1.svd    s2.spec2b.svd
s2.mod      s2.rspec1.svd   s2.spec2.svd    s2.spec2c.svd

work/data/edfs:
F01.edf F03.edf F05.edf F07.edf F09.edf M01.edf M03.edf M05.edf M07.edf M09.edf
F02.edf F04.edf F06.edf F08.edf F10.edf M02.edf M04.edf M06.edf M08.edf M10.edf

Retrieving the originals

All QC steps (steps 1-4 of this demonstration are based on the manipulated datasets (orig/v2). The analysis section (step 5) is based on the cleaned data from these steps (which is expected to reside in work/clean). Note that occasionally we'll retrieve data from the original (pre-manipulation) versions of the data (orig/v1), as needed. For example, for truncated EDFs, or scrambled stage annotations, the QC process can detect there is a problem, but naturally it cannot magically fix those problems. Here, we'll pull the originals, which you can think of as corresponding to calling the original investigator/lab that generated the data and requesting a re-export, or re-staged file, etc.

Tools

Luna

Obviously, the walkthrough requires that you have an up-to-date version of Luna available. See these installation notes to obtain Luna. Luna as bundled in a Dockerized Jupyter lab environment may be a good place to start for most new users. See below for notes on setting this up.

Look at the initial tutorial as well as the command reference and general syntax as needed, to understand usage of luna, destrat and behead tools.

Using Jupyter lab

Note that you can use Jupyter lab to perform both the command-line/R variant of the walkthrough, as well as the Python-based variant. (There are seemingly some minor issues in controlling plot sizes from R if using JupyterLab.)

Note for Windows users

We suggest using Docker and the Jupyter lab environment, which provides a full shell, text editor and comes with all necessary software bundled, including luna compiled as an R library and the Python-based lunapi.

Shell environment

It is important to have basic familiarity with a shell environment, such as bash. There is a wealth of easily accessible material online to guide you here, if needed. You should be able to:

  • apply basic command line operations (e.g. moving files, changing directories)
  • understand the basics of shell redirection and piping, and shell variables
  • use basic awk commands (and potentially regular expressions in grep and sed)

The walkthrough does not demand particularly deep knowledge of these things (which are worth learning in any case).

Shell orientation

If you are not familiar with the shell environment, or are unsure, you might want to review the shell orientation section below, which tries to review the core set of competencies that will be useful for working on the command line when using Luna (or other data-oriented command line tools).

Text-editor

Find a command-line text editor that you are happy with. If using Jupyter lab, you can use the built-in text editor. Otherwise, popular choices include Micro, Atom, Visual Studio Code, as well as classic tools such as nano, pico, vim or emacs.

Docker

Perhaps the most straightforward way to follow this walkthrough is to use the lunapi Docker container. This includes the command line version of Luna as well as the R and Python packages in a platform-agnostic manner. Further, it provides a terminal/console and a text-editor that can be used to follow the walkthrough, either via the command-line path, or primarily using the Python-based lunapi package.

Follow these steps to obtain and start a Dockerized version of the Luna tools:

  • if not already present, install Docker Desktop on your local machine

  • once installed, obtain the Docker image:

    docker pull remnrem/lunapi
    

  • start the container

    docker run --rm -p 8889:8888 -v ${PWD}:/lunapi/ remnrem/lunapi start-notebook.py --NotebookApp.token='abc'
    

  • visit 127.0.0.1:8889 in your browser, and enter the token abc

  • open a Terminal and a parallel Notebook with an R kernel (and optionally, one with Python 3 as well); you should be able to follow all steps toggling between those tabs within Jupyter Lab (with these pages open in another browser, of course)

Shell orientation

More involved material: can be skipped initially

This section is more involved and can probably be skipped before diving in... but it is worth returning to, especially as how the shell (versus other tools) handles variables is a common source of confusion.

If you can (more or less) follow the logic of the steps below, you'll be in good shape. If you can't, try some of the resources linked below to figure out the answers.

Running these commands

You can just look at the commands below to check you understand them. If you want to actually execute them too, the file file.txt is in the original walkthrough folders (orig/aux/file.txt). Alternatively, you can make one yourself with a text editor (it should use tabs to separate columns for the steps to play out as below).

Make a temporary folder

mkdir tmp1
Move into it
cd tmp1
Copy a pre-existing file from a different folder (orig/aux/) in the parent folder (../) of this current folder (.)
cp ../orig/aux/file.txt .
Confirm that file.txt now exists within the current (tmp1) folder
ls 
file.txt
Display the contents of the text file with cat
cat file.txt
Col1  Col2  Col3
A     N     Row one
B     Y     Row two
C     N     Row three
D     Y     Row four
Make a copy of this file in the current folder called copy.txt
cp file.txt copy.txt
But then decide to change its name to file2.txt
mv copy.txt file2.txt
Concatenate the two (identical) files and redirect (>) the output to file3.txt
cat file.txt file2.txt > file3.txt
Check the contents of file3.txt
cat file3.txt
Col1  Col2  Col3
A     N     Row one
B     Y     Row two
C     N     Row three
D     Y     Row four
Col1  Col2  Col3
A     N     Row one
B     Y     Row two
C     N     Row three
D     Y     Row four
Look at only the first N rows of a file (e.g. 3) with head
head -3 file3.txt
Col1  Col2  Col3
A     N     Row one
B     Y     Row two
Extract only the second column of a file (assuming the fields in the files are tab-delimited)
cut -f2 file3.txt
Col2
N
Y
N
Y
Col2
N
Y
N
Y
Count the values in the second column by piping (|) the output of cut into sort (to order all rows) and then uniq -c to retain only unique rows (by merging identical adjacent rows) and counting the number of merged rows
cut -f2 file3.txt | sort | uniq -c
   2 Col2
   4 N
   4 Y
We can break up the above multi-step command to check we understand it:
cut -f2 file3.txt | sort
Col2
Col2
N
N
N
N
Y
Y
Y
Y
Then sending the output of sort to the input of uniq but without the count (-c) argument
cut -f2 file3.txt | sort | uniq 
Col2
N
Y
And to do the same but allowing the commands to span multiple lines, using the \ character as the final character per line
cut -f2 file3.txt \
   | sort \
   | uniq

Look at rows from the original file.txt after the first (header) row using awk (i.e. NR means row number)

awk ' NR > 1 ' file.txt
A     N     Row one
B     Y     Row two
C     N     Row three
D     Y     Row four
To repeat but extracting only columns 1 and 2 (noting the condition { action } form of full awk statements, and that awk processes a text file row by row)
awk ' NR > 1 { print $1 , $2 } ' file.txt
A     N
B     Y
C     N
D     Y
To print entries from column 3 if column 2 has a Y value
awk ' $2 == "Y" { print $3 } ' file.txt
Row
Row
To realize that (unlike cut) by default awk delimits columns on whitespace (tabs or spaces), where NF is the number of fields (columns) in each row
awk ' { print NF } ' file.txt
3
4
4
4
4
To explicitly request tab (\t) delimiters only, with -F
awk -F"\t" ' { print NF } ' file.txt
3
3
3
3
3
To repeat the above (print entries from column 3 if column 2 has a Y value) now with the correct tab delimiters
awk -F"\t" ' $2 == "Y" { print $3 } ' file.txt
Row two
Row four

(More involved!) To use awk variables, conditionals and multi-step commands, e.g. to print column 2 for odd rows, but column 3 for even rows (where % is the modulus operator). And additionally setting the output to be tab-delimited using output field separator (OFS) option.

awk -F"\t" ' { even = NR % 2 == 0; c = even ? 3 : 2; l = even ? "Even" : "Odd" }
             { print l , $c } ' OFS="\t" file.txt

Repeating the above, noting that awk (and Luna) commands using '-quotes can span multiple lines without \ characters (and with different {} and ; formatting here, too)

awk -F"\t" ' {
               even = NR % 2 == 0
               c = even ? 3 : 2
               l = even ? "Even" : "Odd" 
               print l , $c
             } ' OFS="\t" file.txt       

To use a shell variable to specify, e.g., a file name

f="file2.txt"
cat ${f}
Col1  Col2  Col3
A     N     Row one
B     Y     Row two
C     N     Row three
D     Y     Row four
To pass a shell variable to awk and understand the difference between $j, j and ${j} below
j=2
awk -F"\t" ' { print $j } ' j=${j} file.txt
Col2
N
Y
N
Y


Variables

Obviously there is much more to learn, and awk itself has a large number of options and more involved syntax, but if you can acquire enough familiarity to be able to follow the logic and note the (sometimes subtle) syntactic differences between the commands above, you'll be in good shape. Don't be too put off by what might at first appear to be idiosyncratic convention and fussy syntax. The good news is that a little knowledge goes a long way (and is not at all dangerous...!).

For example, in the final example, we first set a shell variable called j to a value 2; when invoking awk, we reference the shell variable (by ${j}, although $j would also work here) and assign it to a variable that awk will understand when processing file.txt. We happen to give the same label here (j) but it could be anything, e.g. x=${j}.

Then, when parsing the portion of text inside the single-quoted region ('), which is what awk takes as its instructions, awk will have access to that variable. Inside awk, the $ sign means the column number rather than pointing to a variable (as it does in bash). Thus print $j is actually interpreted print $2 which means print the second column, which is what we have.

It can be handy to play around with the syntax, i.e. this would give the same output, as noted above:

awk -F"\t" ' { print $x } ' x=${j} file.txt
whereas this would print something different (can you guess what?):
awk -F"\t" ' { print x } ' x=${j} file.txt
and, as a further example, the statement below would give a syntax error from awk (as the {} brackets are interpreted differently by bash than awk)
awk -F"\t" ' { print ${x} } ' x=${j} file.txt


We've labored this distinction about passing variables from the shell into a second program as this often occurs when using Luna on the command line, and it can be a source of confusion for beginners:

luna s.lst s=${s} -o out.db -s ' PSD sig=${s} dB spectrum max=25 '
The first s=${s} passes the shell variable to Luna, i.e. if s was set to C4 it would be identical to writing:
luna s.lst s=C4 -o out.db -s ' PSD sig=${s} dB spectrum max=25 '

Then within the Luna script (which uses single-quote delimiters like awk) the ${s} references the Luna variable s (not the bash variable); also note, unlike awk, Luna requires the same syntax for variables, ${x}. The key point is that ${s} refers to something conceptually different within the Luna script (which could be read from a file) compared to the shell variable, despite the fact that in this particular example they are set to the same value (and nominally have the same label too).

Assuming the shell variable s is set to C4, it is important to understand these differences, which reflect the way that the shell and Luna work:


Works as above: but passes shell variable s to Luna variable x:

luna s.lst x=${s} -o out.db -s ' PSD sig=${x} dB spectrum max=25 '


Works as above: but uses shell variable s directly in the command-line Luna script (note double-quotes ") rather than explicitly assigning it:

luna s.lst -o out.db -s " PSD sig=${s} dB spectrum max=25 "


Would not work as above: within single quotes ${s} is not expanded as a shell variable, but rather is passed to Luna literally as ${s} (rather than the text C4); Luna will then look for a variable s but will complain with an error message as it hasn't been defined:

luna s.lst -o out.db -s ' PSD sig=${s} dB spectrum max=25 '


Finally, the statement below would not work as above: although we are telling Luna to define a variable x (which will be set to the value C4), because the command line is in double-quotes, the shell will try to replace ${x} as a shell variable before Luna gets to see the command: but as x is not defined as a shell variable, what Luna will actually see is PSD sig= dB spectrum max=25 as the shell expands undefined variables to a null string:

luna s.lst -o out.db x=${s} -s " PSD sig=${x} dB spectrum max=25 "


Key point

The key point is to be aware of which tool is doing the interpretation of the variable: the shell (bash), or a second tool such as luna or awk. The context and syntax can be used to control these things.

More information on shell scripting

For more information on shell scripting, you might consider the following types of resources that look reasonable from a first glance (but certainly contain much, much more information than is required to get started using Luna, so don't feel you need to read it all...):


Finally, to remove this tmp1 working folder and all the files therein:

cd ..
rm -rf tmp1

Hopefully, this shell orientation was helpful: now let's proceed to initial data QC.