Step 0: project preparation
This page provides notes on obtaining the data and tools necessary for the walkthrough, as well as some general orientation.
Environments in the walkthrough
The walkthrough uses several environments:
-
the shell (i.e. a command-line or terminal, assumed to be
bash
) to run Luna and perform basic file manipulation tasks -
R to analyse and visualize outputs from Luna
-
optionally, Python via JupyterLab for viewing raw data (a parallel walkthrough based entirely in Python/JupyterLab will be described elsewhere)
Commands that should be entered on the shell prompt (including Luna statements) are shown with a gray-blue background: e.g.
luna s.lst -o out.db < cmd.txt
Commands that should be executed in R are shown with a sage background: e.g.
library(luna)
k <- ldb( "out.db" )
All outputs (whether from the shell, Luna or R) are shown with a gray background: e.g.
F01.annot F05.annot F09.annot M03.tsv M07.eannot
F02.annot F06.annot F10.annot M04.tsv M08.eannot
F03.annot F07.annot M01.csv M05.eannot M09.xml
F04b.annot F08.annot M02.csv M06.eannot M10.xml
In various places throughout the walkthrough, we may refer to performing a set of commands in a particular context but not always repeat the full instructions:
Term | Description |
---|---|
...in the shell... or ...on the command line... | implies use bash (e.g. Terminal App in macOS, or Terminal in JupyterLab), e.g. to run luna or destrat directly |
...in R... or ...using lunaR... | implies opening R and running library(luna) |
...using lunapi... or ...in Python... | implies opening Python (i.e. via JupyterLab) and running import lunapi as lp |
Data
The example data (20 whole-night hd-EEG studies) are available from the National Sleep Research Resource (NSRR). Put in an application to access the Luna/GRINS walkthrough dataset
Accessing these data via NSRR
There may be a delay in the tutorial files being fully available via NSRR, and
so they may not be posted yet; please contact luna.remnrem@gmail.com
in the meantime
for advice on the timeline.
After getting access and downloading these data (potentially
extracting the contents of the archive), you should see a single
folder entitled orig/
, with three sub-folders, v1
, v2
and aux
:
The files in aux/
are used at various places in the walkthrough and
will be described at that point. Not all files in aux
are listed
here, only some key ones.
Folder | Contents |
---|---|
orig/v1 |
Original version of the data |
orig/v1/edfs |
Original EDFs |
orig/v1/annots |
Original annotation files |
orig/v2 |
Manipulated version of the data |
orig/v2/edfs |
Manipulated EDFs |
orig/v2/annots |
Manipulated annotation files |
orig/aux/ |
Auxiliary datafiles used in the walkthrough |
orig/aux/master.txt |
Basic demographic information (age, sex) |
orig/aux/amaps |
Mapping file for annotations |
orig/aux/cmaps |
Mapping file for channels |
orig/aux/badchs.txt |
List of channels to impute |
orig/aux/clocs |
Channel location information |
orig/aux/models/ |
Subfolder with age-prediction model files |
orig/aux/pops/ |
Subfolder with POPS model files |
Project set-up
We assume all analyses occur in a working folder that contains the above orig/
folder
as well as a folder named work
; you can create this from the shell:
mkdir -p work/data work/harm1 work/harm2 work/clean
./ (current folder)
|
|---> orig/
| |---> v1/
| |---> v2/
| |---> aux/
|
|---> work/
|---> data/
|---> harm1/
|---> harm2/
|---> clean/
As created above, work/
will contain four sub-folders, which we'll use to
store different iterations of the dataset as it goes through quality
control (QC):
-
data
: a simple copy of the original folderorig/v2
(i.e. the manipulated original files) -
harm1
: new EDFs and annotations created that are harmonized for basic properties (file format, harmonized labels, etc), based on following step 1 of this demonstration -
harm2
: further harmonized EDFs and annotations, with changes made to give consistent sample rates, units, EEG polarities for a set of standard (ungapped) EDFs, based on following step 2 and step 3 of this demonstration -
clean
: a final, analysis-ready cleaned dataset, following epoch-level artifact correction as described in step 4 of this demonstration
To follow the walkthorugh, commands should be executed in the
working (current folder above) that contains both orig/
and data/
.
Now we'll populate the new work/data/
folder with the core elements
needed from orig/v2
, here listing out the folders/files explicitly
(this step may take half a minute or so):
cp -r orig/v2/edfs orig/v2/annots orig/aux work/data/
ls -R work/data
annots aux edfs
work/data/annots:
F01.annot F05.annot F09.annot M03.tsv M07.eannot
F02.annot F06.annot F10.annot M04.tsv M08.eannot
F03.annot F07.annot M01.csv M05.eannot M09.xml
F04b.annot F08.annot M02.csv M06.eannot M10.xml
work/data/aux:
amaps female.ids models
badchs.txt file.txt n106.psd.n2.proj
clocs lm.sigs pops
cm.sigs male.ids specs.json
cmaps master.txt step5.sh
work/data/aux/models:
m1-adult-age-data.txt m1-adult-age-features.txt m1-adult-age-luna.txt
work/data/aux/pops:
s2.conf s2.priors s2.rspec2.svd s2.spec2a.svd
s2.ftr s2.ranges s2.spec1.svd s2.spec2b.svd
s2.mod s2.rspec1.svd s2.spec2.svd s2.spec2c.svd
work/data/edfs:
F01.edf F03.edf F05.edf F07.edf F09.edf M01.edf M03.edf M05.edf M07.edf M09.edf
F02.edf F04.edf F06.edf F08.edf F10.edf M02.edf M04.edf M06.edf M08.edf M10.edf
Retrieving the originals
All QC steps (steps 1-4 of this
demonstration are based on the manipulated datasets (orig/v2
).
The analysis section (step 5) is based on the cleaned data from
these steps (which is expected to reside in work/clean
). Note
that occasionally we'll retrieve data from the original
(pre-manipulation) versions of the data (orig/v1
), as needed.
For example, for truncated EDFs, or scrambled stage annotations,
the QC process can detect there is a problem, but naturally it
cannot magically fix those problems. Here, we'll pull the
originals, which you can think of as corresponding to calling the
original investigator/lab that generated the data and requesting a
re-export, or re-staged file, etc.
Tools
Luna
Obviously, the walkthrough requires that you have an up-to-date version of Luna available. See these installation notes to obtain Luna. Luna as bundled in a Dockerized Jupyter lab environment may be a good place to start for most new users. See below for notes on setting this up.
Look at the initial
tutorial as well as the
command reference and
general syntax as needed, to
understand usage of luna
, destrat
and behead
tools.
Using Jupyter lab
Note that you can use Jupyter lab to perform both the command-line/R variant of the walkthrough, as well as the Python-based variant. (There are seemingly some minor issues in controlling plot sizes from R if using JupyterLab.)
Note for Windows users
We suggest using Docker and the Jupyter lab environment, which provides a full shell, text editor and comes with all necessary software bundled, including luna compiled as an R library and the Python-based lunapi.
Shell environment
It is important to have basic familiarity with a shell environment,
such as bash
. There is a wealth of easily accessible material
online to guide you here, if needed. You should be able to:
- apply basic command line operations (e.g. moving files, changing directories)
- understand the basics of shell redirection and piping, and shell variables
- use basic
awk
commands (and potentially regular expressions ingrep
andsed
)
The walkthrough does not demand particularly deep knowledge of these things (which are worth learning in any case).
Shell orientation
If you are not familiar with the shell environment, or are unsure, you might want to review the shell orientation section below, which tries to review the core set of competencies that will be useful for working on the command line when using Luna (or other data-oriented command line tools).
Text-editor
Find a command-line text editor that you are happy with. If using Jupyter lab, you can use the built-in text editor. Otherwise, popular choices include Micro, Atom, Visual Studio Code, as well as classic tools such as nano, pico, vim or emacs.
Docker
Perhaps the most straightforward way to follow this walkthrough is to
use the lunapi
Docker container. This includes the command line
version of Luna as well as the R and Python packages in a
platform-agnostic manner. Further, it provides a terminal/console and
a text-editor that can be used to follow the walkthrough, either via
the command-line path, or primarily using the Python-based lunapi
package.
Follow these steps to obtain and start a Dockerized version of the Luna tools:
-
if not already present, install Docker Desktop on your local machine
-
once installed, obtain the Docker image:
docker pull remnrem/lunapi
-
start the container
docker run --rm -p 8889:8888 -v ${PWD}:/lunapi/ remnrem/lunapi start-notebook.py --NotebookApp.token='abc'
-
visit 127.0.0.1:8889 in your browser, and enter the token
abc
-
open a Terminal and a parallel Notebook with an R kernel (and optionally, one with Python 3 as well); you should be able to follow all steps toggling between those tabs within Jupyter Lab (with these pages open in another browser, of course)
Shell orientation
More involved material: can be skipped initially
This section is more involved and can probably be skipped before diving in... but it is worth returning to, especially as how the shell (versus other tools) handles variables is a common source of confusion.
If you can (more or less) follow the logic of the steps below, you'll be in good shape. If you can't, try some of the resources linked below to figure out the answers.
Running these commands
You can just look at the commands
below to check you understand them. If you want to actually
execute them too, the file file.txt
is in the original
walkthrough folders (orig/aux/file.txt
). Alternatively, you can
make one yourself with a text editor (it should use tabs to
separate columns for the steps to play out as below).
Make a temporary folder
mkdir tmp1
cd tmp1
orig/aux/
) in the parent folder (../
) of this current folder (.
)
cp ../orig/aux/file.txt .
file.txt
now exists within the current (tmp1
) folder
ls
file.txt
cat
cat file.txt
Col1 Col2 Col3
A N Row one
B Y Row two
C N Row three
D Y Row four
copy.txt
cp file.txt copy.txt
file2.txt
mv copy.txt file2.txt
>
) the output to file3.txt
cat file.txt file2.txt > file3.txt
file3.txt
cat file3.txt
Col1 Col2 Col3
A N Row one
B Y Row two
C N Row three
D Y Row four
Col1 Col2 Col3
A N Row one
B Y Row two
C N Row three
D Y Row four
head
head -3 file3.txt
Col1 Col2 Col3
A N Row one
B Y Row two
cut -f2 file3.txt
Col2
N
Y
N
Y
Col2
N
Y
N
Y
|
) the output of cut
into sort
(to order all rows) and then uniq -c
to retain only
unique rows (by merging identical adjacent rows) and counting the number of merged rows
cut -f2 file3.txt | sort | uniq -c
2 Col2
4 N
4 Y
cut -f2 file3.txt | sort
Col2
Col2
N
N
N
N
Y
Y
Y
Y
sort
to the input of uniq
but without the count (-c
) argument
cut -f2 file3.txt | sort | uniq
Col2
N
Y
\
character as the final character per line
cut -f2 file3.txt \
| sort \
| uniq
Look at rows from the original file.txt
after the first (header) row using awk
(i.e. NR
means row number)
awk ' NR > 1 ' file.txt
A N Row one
B Y Row two
C N Row three
D Y Row four
condition { action }
form of full awk
statements, and that awk
processes a text file row by row)
awk ' NR > 1 { print $1 , $2 } ' file.txt
A N
B Y
C N
D Y
Y
value
awk ' $2 == "Y" { print $3 } ' file.txt
Row
Row
cut
) by default awk
delimits columns on whitespace (tabs or spaces), where NF
is the number of fields (columns) in each row
awk ' { print NF } ' file.txt
3
4
4
4
4
\t
) delimiters only, with -F
awk -F"\t" ' { print NF } ' file.txt
3
3
3
3
3
Y
value) now with the correct tab delimiters
awk -F"\t" ' $2 == "Y" { print $3 } ' file.txt
Row two
Row four
(More involved!) To use awk
variables, conditionals and multi-step
commands, e.g. to print column 2 for odd rows, but column 3 for even
rows (where %
is the modulus operator). And additionally setting the
output to be tab-delimited using output field separator (OFS
) option.
awk -F"\t" ' { even = NR % 2 == 0; c = even ? 3 : 2; l = even ? "Even" : "Odd" }
{ print l , $c } ' OFS="\t" file.txt
Repeating the above, noting that awk (and Luna) commands using '
-quotes can span multiple lines without \
characters (and with different {}
and ;
formatting here, too)
awk -F"\t" ' {
even = NR % 2 == 0
c = even ? 3 : 2
l = even ? "Even" : "Odd"
print l , $c
} ' OFS="\t" file.txt
To use a shell variable to specify, e.g., a file name
f="file2.txt"
cat ${f}
Col1 Col2 Col3
A N Row one
B Y Row two
C N Row three
D Y Row four
awk
and understand the difference between $j
, j
and ${j}
below
j=2
awk -F"\t" ' { print $j } ' j=${j} file.txt
Col2
N
Y
N
Y
Variables
Obviously there is much more to learn, and awk
itself has a large
number of options and more involved syntax, but if you can acquire
enough familiarity to be able to follow the logic and note the
(sometimes subtle) syntactic differences between the commands above, you'll be
in good shape. Don't be too put off by what might at first appear to
be idiosyncratic convention and fussy syntax. The good news is that a
little knowledge goes a long way (and is not at all dangerous...!).
For example, in the final example, we first set a shell variable
called j
to a value 2
; when invoking awk
, we reference the shell
variable (by ${j}
, although $j
would also work here) and assign it
to a variable that awk
will understand when processing file.txt
.
We happen to give the same label here (j
) but it could be anything,
e.g. x=${j}
.
Then, when parsing the portion of text inside the single-quoted region
('
), which is what awk
takes as its instructions, awk
will have
access to that variable. Inside awk
, the $
sign means the column
number rather than pointing to a variable (as it does in bash
).
Thus print $j
is actually interpreted print $2
which means print
the second column, which is what we have.
It can be handy to play around with the syntax, i.e. this would give the same output, as noted above:
awk -F"\t" ' { print $x } ' x=${j} file.txt
awk -F"\t" ' { print x } ' x=${j} file.txt
awk
(as the {}
brackets are interpreted differently by bash
than awk
)
awk -F"\t" ' { print ${x} } ' x=${j} file.txt
We've labored this distinction about passing variables from the shell into a second program as this often occurs when using Luna on the command line, and it can be a source of confusion for beginners:
luna s.lst s=${s} -o out.db -s ' PSD sig=${s} dB spectrum max=25 '
s=${s}
passes the shell variable to Luna, i.e. if s
was set to C4
it would be identical to writing:
luna s.lst s=C4 -o out.db -s ' PSD sig=${s} dB spectrum max=25 '
Then within the Luna script (which uses single-quote delimiters like
awk
) the ${s}
references the Luna variable s
(not the bash
variable); also note, unlike awk
, Luna requires the same syntax for
variables, ${x}
. The key point is that ${s}
refers to something
conceptually different within the Luna script (which could be read from a file)
compared to the shell variable, despite the fact that in this particular example
they are set to the same value (and nominally have the same label too).
Assuming the shell variable s
is set to C4
, it is important to
understand these differences, which reflect the way that the shell and
Luna work:
Works as above: but passes shell variable s
to Luna variable x
:
luna s.lst x=${s} -o out.db -s ' PSD sig=${x} dB spectrum max=25 '
Works as above: but uses shell variable s
directly in the command-line Luna script (note double-quotes "
) rather than explicitly assigning it:
luna s.lst -o out.db -s " PSD sig=${s} dB spectrum max=25 "
Would not work as above: within single quotes ${s}
is not expanded as a shell variable, but rather is passed to Luna literally as ${s}
(rather than the text C4
); Luna
will then look for a variable s
but will complain with an error message as it hasn't been defined:
luna s.lst -o out.db -s ' PSD sig=${s} dB spectrum max=25 '
Finally, the statement below would not work as above: although we are telling Luna to define a variable x
(which will be set to the value C4
), because the command line is in double-quotes, the shell will try to replace ${x}
as a shell variable before Luna gets to see the command: but as x
is not defined as a shell variable, what Luna will actually see is PSD sig= dB spectrum max=25
as the shell expands undefined variables to a null string:
luna s.lst -o out.db x=${s} -s " PSD sig=${x} dB spectrum max=25 "
Key point
The key point is to be aware of which tool is doing the interpretation of the variable: the shell (bash
), or a second tool such as luna
or awk
. The context and syntax
can be used to control these things.
More information on shell scripting
For more information on shell scripting, you might consider the following types of resources that look reasonable from a first glance (but certainly contain much, much more information than is required to get started using Luna, so don't feel you need to read it all...):
Finally, to remove this tmp1
working folder and all the files therein:
cd ..
rm -rf tmp1
Hopefully, this shell orientation was helpful: now let's proceed to initial data QC.