Convenience functions to load/process Luna output (long-format)
(to be added to lunaR library)

Source:   http://zzz.bwh.harvard.edu/dist/luna/lload.R
 
Dependencies: the `data.table` library
 
Functions:

 lhead( )

    peek at variable names in a file (w/out loading the whole thing)
 

 lload() 

    load a file, and (optionally) extract certain variables,
    log-transform certain variables, transform the matrix to be “wide”
    format (i.e. one row per person ) and return a matching
    ‘meta-data’ data-frame that contains (as rows) information on the
    columns in the wide-format data-frame.  See example below.  You
    can also ‘assign’ a sleep stage ‘factor’ – often this is not
    present in the file, as all analysis was for, say, N2, but this
    makes it easier if you then want to merge different files with the
    otherwise same variable names.  (In truth, it could be any other
    file-level specifier, not just a sleep stage.)
 
lcols() 
    
    to programmatically make it easier to pick certain variable names
    (or column indices) based on the variables and/or factors in the
    dataset.
 
 
Examples: 
 
  > source("http://zzz.bwh.harvard.edu/dist/luna/lload.R")
 
Peek at file (e.g. PSI per channel in this example)
 
  > lhead( "/data/purcell/scratch/pats/tmp/n2.psi1" ) 
  [1] "ID"      "CH"      "F"       "PSI"     "PSI_RAW" "STD"    
  
Both lhead() and lload() allow a separate prefix, which can sometimes
be more convenient: e.g.  if you want to swap for VPN to
“/Volumes/purcell/” etc .
 
  > lhead( “n2.psi1" , prefix = "/data/purcell/scratch/pats/tmp” ) 
  [1] "ID"      "CH"      "F"       "PSI"     "PSI_RAW" "STD"    
 
The lload() function does the main work:
 
  > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1" ) 
 
By itself, it hasn’t done anything that interesting, other than simply load the table:
 
  > str(k) 
  List of 2
   $ df     :'data.frame':    42390 obs. of  6 variables:
    ..$ ID     : chr [1:42390] "1-0160-093016" "1-0160-093016" "1-0160-093016" "1-0160-093016" ...
    ..$ CH     : chr [1:42390] "C3" "C4" "F3" "F4" ...
    ..$ F      : int [1:42390] 3 3 3 3 3 3 4 4 4 4 ...
    ..$ PSI    : num [1:42390] 8.7 10.86 -13.1 -14.3 6.24 ...
    ..$ PSI_RAW: num [1:42390] 0.114 0.148 -0.1946 -0.2461 0.0964 ...
    ..$ STD    : num [1:42390] 0.0131 0.0136 0.0149 0.0172 0.0155 ...
   $ df.meta: NULL
 
The function returns a list of two objects:  
 
   ‘df’  – the actual data-frame
   ‘df.meta’  -- a matching meta-data data-frame
 
As in this case we did not specify any factors, this is empty. 
 
But we know (from looking at the header, and understanding the context
of the data) that F and CH are stratifying factors, whereas PSI, etc
are the actual variables.  To get a wide-format dataset we must specify
the factors explicity:
 
  > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1" , factors = c( "F" , "CH" ) ) 
 
Now the data-frame has 394 rows, which is the number of individuals,
and it is expanded out to 325 columns: one ID column, and then 324
variables, which are labelled according to data.table’s dcast()
syntax, i.e. with underscores:
 
  > dim( k$df ) 
  [1] 394 325
 
  > head( names( k$df )  ) 
  [1] "ID"       "PSI_3_C3" "PSI_3_C4" "PSI_3_F3" "PSI_3_F4" "PSI_3_O1"
 
Wide-format frames can often be easier to work with, at least when the
stratifiers aren’t sparse (i.e. everybody has more or less the same
set of bands/channels, etc)
 
We can add in 'external' stratifiers -- ones that apply for the entire file, and so are 
not represented by any column in that file.  For example, to denote that this file has
metrics from stage N2.  This can be useful if we want to merge this with say REM
or N3 output downstream:
 
  > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1", 
                factors = c("F", "CH"), fixed = list( SS = "N2" ) ) 
 
This basically just appends “N2” to each name, in a column called SS:
 
  > head( names( k$df )  ) 
  [1] "ID"          "PSI_3_C3_N2" "PSI_3_C4_N2" "PSI_3_F3_N2" "PSI_3_F4_N2"
  [6] "PSI_3_O1_N2"
 
To make it easier to work with these files (i.e which then may have
1000s – 10000s of variables), we also have the meta-data object
returned: in this last instance:
 
  > head( k$df.meta ) 
    BASE F CH SS COL         VAR
  1  PSI 3 C3 N2   2 PSI_3_C3_N2
  2  PSI 3 C4 N2   3 PSI_3_C4_N2
  3  PSI 3 F3 N2   4 PSI_3_F3_N2
  4  PSI 3 F4 N2   5 PSI_3_F4_N2
  5  PSI 3 O1 N2   6 PSI_3_O1_N2
  6  PSI 3 O2 N2   7 PSI_3_O2_N2

That is, the rows of df.meta correspond to the columns of the main ‘df’.
 
i.e.  

   BASE – the ‘core’ variable name (i.e. columns from the original,
   PSI, PSI_RAW and STD). 

   {factors} - the next set of columns are the factors
   specified via factors (F, CH) or fixed (SS) options.  

   COL – the corresponding column in ‘df’ (basically just row
   number+1, i.e. accounting for the ID column)

   VAR – the ‘full’ variable name in ‘df’ (i.e. made up of the
   constituent parts here: BASE+FACTORS)
 
In this way, you can pull out sets of columns from the main data table quite easily, e.g. all PSI for C3
 
  > k$df.meta$VAR[ k$df.meta$BASE == "PSI" & k$df.meta$CH %in% c( "C3" , "C4" ) ]
   [1] "PSI_3_C3_N2"  "PSI_3_C4_N2"  "PSI_4_C3_N2"  "PSI_4_C4_N2"  "PSI_5_C3_N2" 
   [6] "PSI_5_C4_N2"  "PSI_6_C3_N2"  "PSI_6_C4_N2"  "PSI_7_C3_N2"  "PSI_7_C4_N2" 
  [11] "PSI_8_C3_N2"  "PSI_8_C4_N2"  "PSI_9_C3_N2"  "PSI_9_C4_N2"  "PSI_10_C3_N2"
  [16] "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2" "PSI_12_C4_N2"
  [21] "PSI_13_C3_N2" "PSI_13_C4_N2" "PSI_14_C3_N2" "PSI_14_C4_N2" "PSI_15_C3_N2"
  [26] "PSI_15_C4_N2" "PSI_16_C3_N2" "PSI_16_C4_N2" "PSI_17_C3_N2" "PSI_17_C4_N2"
  [31] "PSI_18_C3_N2" "PSI_18_C4_N2" "PSI_19_C3_N2" "PSI_19_C4_N2" "PSI_20_C3_N2"
  [36] "PSI_20_C4_N2"
 
And use those to select out columns from the main data table: e.g.
 
  > vars <- k$df.meta$VAR[ k$df.meta$BASE == "PSI" & k$df.meta$CH %in% c( "C3" , "C4" ) ]
  > head( k$df[ , vars ]  ) 
 
To make the syntax a little nicer, the final lcols() function is
basically just a wrapper around the above type of code :
 
  > lcols( k$df.meta , variable = "PSI" , factors = list( CH = c("C3" , "C4" ) ) ) 
   [1] "PSI_3_C3_N2"  "PSI_3_C4_N2"  "PSI_4_C3_N2"  "PSI_4_C4_N2"  "PSI_5_C3_N2" 
   [6] "PSI_5_C4_N2"  "PSI_6_C3_N2"  "PSI_6_C4_N2"  "PSI_7_C3_N2"  "PSI_7_C4_N2" 
  [11] "PSI_8_C3_N2"  "PSI_8_C4_N2"  "PSI_9_C3_N2"  "PSI_9_C4_N2"  "PSI_10_C3_N2"
  [16] "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2" "PSI_12_C4_N2"
  [21] "PSI_13_C3_N2" "PSI_13_C4_N2" "PSI_14_C3_N2" "PSI_14_C4_N2" "PSI_15_C3_N2"
  [26] "PSI_15_C4_N2" "PSI_16_C3_N2" "PSI_16_C4_N2" "PSI_17_C3_N2" "PSI_17_C4_N2"
  [31] "PSI_18_C3_N2" "PSI_18_C4_N2" "PSI_19_C3_N2" "PSI_19_C4_N2" "PSI_20_C3_N2"
  [36] "PSI_20_C4_N2"
 

Note that, unlike lload(), lcols() does not (and does not need to) make a distinction between
factors present in the file and those that are 'fixed' (i.e. factors and fixed arguments).  For
lcols(), all the above are specified via the factors argument.

If there are multiple factors:

 - if a factor is not specified (i.e. “SS” and “F” in the above), then
   this is ignored in the matching (i.e. all rows returned)
 
 - if a factor has multiple options (as a vector, e.g. CH = c( "C3" ,
   "C4" ) ) then the function returns matches to EITHER C3 /OR/ C4

 - if multiple factors (e.g. F and CH) it returns rows that match F /AND/ CH, e.g.
 
For example, all PSI values for channel of either C3 or C4, and that have F between 10 and 12 Hz:
 
  > lcols( k$df.meta , variable = "PSI" , factors = list( F = 10:12 , CH = c("C3" , "C4" ) ) ) 
  [1] "PSI_10_C3_N2" "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2"
  [6] "PSI_12_C4_N2"
 
Note that ‘variables’ can be a vector too, e.g.  variables = c(“PSI” , “PSI_RAW”)