Convenience functions to load/process Luna output (long-format) (to be added to lunaR library) Source: http://zzz.bwh.harvard.edu/dist/luna/lload.R Dependencies: the `data.table` library Functions: lhead( ) peek at variable names in a file (w/out loading the whole thing) lload() load a file, and (optionally) extract certain variables, log-transform certain variables, transform the matrix to be “wide” format (i.e. one row per person ) and return a matching ‘meta-data’ data-frame that contains (as rows) information on the columns in the wide-format data-frame. See example below. You can also ‘assign’ a sleep stage ‘factor’ – often this is not present in the file, as all analysis was for, say, N2, but this makes it easier if you then want to merge different files with the otherwise same variable names. (In truth, it could be any other file-level specifier, not just a sleep stage.) lcols() to programmatically make it easier to pick certain variable names (or column indices) based on the variables and/or factors in the dataset. Examples: > source("http://zzz.bwh.harvard.edu/dist/luna/lload.R") Peek at file (e.g. PSI per channel in this example) > lhead( "/data/purcell/scratch/pats/tmp/n2.psi1" ) [1] "ID" "CH" "F" "PSI" "PSI_RAW" "STD" Both lhead() and lload() allow a separate prefix, which can sometimes be more convenient: e.g. if you want to swap for VPN to “/Volumes/purcell/” etc . > lhead( “n2.psi1" , prefix = "/data/purcell/scratch/pats/tmp” ) [1] "ID" "CH" "F" "PSI" "PSI_RAW" "STD" The lload() function does the main work: > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1" ) By itself, it hasn’t done anything that interesting, other than simply load the table: > str(k) List of 2 $ df :'data.frame': 42390 obs. of 6 variables: ..$ ID : chr [1:42390] "1-0160-093016" "1-0160-093016" "1-0160-093016" "1-0160-093016" ... ..$ CH : chr [1:42390] "C3" "C4" "F3" "F4" ... ..$ F : int [1:42390] 3 3 3 3 3 3 4 4 4 4 ... ..$ PSI : num [1:42390] 8.7 10.86 -13.1 -14.3 6.24 ... ..$ PSI_RAW: num [1:42390] 0.114 0.148 -0.1946 -0.2461 0.0964 ... ..$ STD : num [1:42390] 0.0131 0.0136 0.0149 0.0172 0.0155 ... $ df.meta: NULL The function returns a list of two objects: ‘df’ – the actual data-frame ‘df.meta’ -- a matching meta-data data-frame As in this case we did not specify any factors, this is empty. But we know (from looking at the header, and understanding the context of the data) that F and CH are stratifying factors, whereas PSI, etc are the actual variables. To get a wide-format dataset we must specify the factors explicity: > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1" , factors = c( "F" , "CH" ) ) Now the data-frame has 394 rows, which is the number of individuals, and it is expanded out to 325 columns: one ID column, and then 324 variables, which are labelled according to data.table’s dcast() syntax, i.e. with underscores: > dim( k$df ) [1] 394 325 > head( names( k$df ) ) [1] "ID" "PSI_3_C3" "PSI_3_C4" "PSI_3_F3" "PSI_3_F4" "PSI_3_O1" Wide-format frames can often be easier to work with, at least when the stratifiers aren’t sparse (i.e. everybody has more or less the same set of bands/channels, etc) We can add in 'external' stratifiers -- ones that apply for the entire file, and so are not represented by any column in that file. For example, to denote that this file has metrics from stage N2. This can be useful if we want to merge this with say REM or N3 output downstream: > k <- lload( "/data/purcell/scratch/pats/tmp/n2.psi1", factors = c("F", "CH"), fixed = list( SS = "N2" ) ) This basically just appends “N2” to each name, in a column called SS: > head( names( k$df ) ) [1] "ID" "PSI_3_C3_N2" "PSI_3_C4_N2" "PSI_3_F3_N2" "PSI_3_F4_N2" [6] "PSI_3_O1_N2" To make it easier to work with these files (i.e which then may have 1000s – 10000s of variables), we also have the meta-data object returned: in this last instance: > head( k$df.meta ) BASE F CH SS COL VAR 1 PSI 3 C3 N2 2 PSI_3_C3_N2 2 PSI 3 C4 N2 3 PSI_3_C4_N2 3 PSI 3 F3 N2 4 PSI_3_F3_N2 4 PSI 3 F4 N2 5 PSI_3_F4_N2 5 PSI 3 O1 N2 6 PSI_3_O1_N2 6 PSI 3 O2 N2 7 PSI_3_O2_N2 That is, the rows of df.meta correspond to the columns of the main ‘df’. i.e. BASE – the ‘core’ variable name (i.e. columns from the original, PSI, PSI_RAW and STD). {factors} - the next set of columns are the factors specified via factors (F, CH) or fixed (SS) options. COL – the corresponding column in ‘df’ (basically just row number+1, i.e. accounting for the ID column) VAR – the ‘full’ variable name in ‘df’ (i.e. made up of the constituent parts here: BASE+FACTORS) In this way, you can pull out sets of columns from the main data table quite easily, e.g. all PSI for C3 > k$df.meta$VAR[ k$df.meta$BASE == "PSI" & k$df.meta$CH %in% c( "C3" , "C4" ) ] [1] "PSI_3_C3_N2" "PSI_3_C4_N2" "PSI_4_C3_N2" "PSI_4_C4_N2" "PSI_5_C3_N2" [6] "PSI_5_C4_N2" "PSI_6_C3_N2" "PSI_6_C4_N2" "PSI_7_C3_N2" "PSI_7_C4_N2" [11] "PSI_8_C3_N2" "PSI_8_C4_N2" "PSI_9_C3_N2" "PSI_9_C4_N2" "PSI_10_C3_N2" [16] "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2" "PSI_12_C4_N2" [21] "PSI_13_C3_N2" "PSI_13_C4_N2" "PSI_14_C3_N2" "PSI_14_C4_N2" "PSI_15_C3_N2" [26] "PSI_15_C4_N2" "PSI_16_C3_N2" "PSI_16_C4_N2" "PSI_17_C3_N2" "PSI_17_C4_N2" [31] "PSI_18_C3_N2" "PSI_18_C4_N2" "PSI_19_C3_N2" "PSI_19_C4_N2" "PSI_20_C3_N2" [36] "PSI_20_C4_N2" And use those to select out columns from the main data table: e.g. > vars <- k$df.meta$VAR[ k$df.meta$BASE == "PSI" & k$df.meta$CH %in% c( "C3" , "C4" ) ] > head( k$df[ , vars ] ) To make the syntax a little nicer, the final lcols() function is basically just a wrapper around the above type of code : > lcols( k$df.meta , variable = "PSI" , factors = list( CH = c("C3" , "C4" ) ) ) [1] "PSI_3_C3_N2" "PSI_3_C4_N2" "PSI_4_C3_N2" "PSI_4_C4_N2" "PSI_5_C3_N2" [6] "PSI_5_C4_N2" "PSI_6_C3_N2" "PSI_6_C4_N2" "PSI_7_C3_N2" "PSI_7_C4_N2" [11] "PSI_8_C3_N2" "PSI_8_C4_N2" "PSI_9_C3_N2" "PSI_9_C4_N2" "PSI_10_C3_N2" [16] "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2" "PSI_12_C4_N2" [21] "PSI_13_C3_N2" "PSI_13_C4_N2" "PSI_14_C3_N2" "PSI_14_C4_N2" "PSI_15_C3_N2" [26] "PSI_15_C4_N2" "PSI_16_C3_N2" "PSI_16_C4_N2" "PSI_17_C3_N2" "PSI_17_C4_N2" [31] "PSI_18_C3_N2" "PSI_18_C4_N2" "PSI_19_C3_N2" "PSI_19_C4_N2" "PSI_20_C3_N2" [36] "PSI_20_C4_N2" Note that, unlike lload(), lcols() does not (and does not need to) make a distinction between factors present in the file and those that are 'fixed' (i.e. factors and fixed arguments). For lcols(), all the above are specified via the factors argument. If there are multiple factors: - if a factor is not specified (i.e. “SS” and “F” in the above), then this is ignored in the matching (i.e. all rows returned) - if a factor has multiple options (as a vector, e.g. CH = c( "C3" , "C4" ) ) then the function returns matches to EITHER C3 /OR/ C4 - if multiple factors (e.g. F and CH) it returns rows that match F /AND/ CH, e.g. For example, all PSI values for channel of either C3 or C4, and that have F between 10 and 12 Hz: > lcols( k$df.meta , variable = "PSI" , factors = list( F = 10:12 , CH = c("C3" , "C4" ) ) ) [1] "PSI_10_C3_N2" "PSI_10_C4_N2" "PSI_11_C3_N2" "PSI_11_C4_N2" "PSI_12_C3_N2" [6] "PSI_12_C4_N2" Note that ‘variables’ can be a vector too, e.g. variables = c(“PSI” , “PSI_RAW”)