dmerge

dmerge is a tool for working with derived metrics from Luna (or other tools that generate individual-level output). By adhering to a particular file and folder naming convention, data from diverse pipelines can be easily compiled and checked across multiple individuals.

Usage

dmerge derived/domains
       studies/study1
       merged/study1.txt
       { domains/groups/exclusions }
       > merged/dictionary1.txt

Primary functions

compiles multiple, heterogeneous, stratified datasets across multiple individuals into a single file
enforces the use of domain-based, data-dictionaries
expands from a long- to wide-format dataset, providing a general convention for representing variable meta-data

Definitions

FACTOR: a variable such as channel, frequency or sleep stage, etc. The level of a factor is its value, e.g. C3, 12.5 (Hz) or N2 respectively. Factors stratify the values of typical variables, rather than represent data by themselves. Factors can appear in the body of a file (for long-format data) or are otherwise encoded in the variable name (see variable naming conventions, below). Factors can also be specified through file naming conventions (e.g. here the factor SS is set to N2: eeg_spectral_PSD-CH_B_SS-N2.txt).

DOMAIN: the highest grouping of data dictionaries; typically a domain will correspond to an NSRR investigator, i.e. one person/group controls the vocabulary within that domain.

GROUP: within a domain, different data dictionaries can be modularized, within each separate data dictionary file representing a group that belongs to a particular domain.

TYPE: all variables are of a specified type, specified in the data dictionary; types are described below.

EXPANDED VARIABLES: wide-format variable names, i.e. whereby factor and level information for a variable is encoded as part of its variable name

Folder naming conventions

All dictionaries must be in a single folder (passed via the first parameter)
All individual data should be in individual subfolders within a study folder: e.g.

data/study1/ie001/
data/study1/id002/
data/study1/id003/
...

Can have one or more intermediate subfolders between study and individual, all folders are searched for data recursively, e.g.:

data/study1/luna/id001/
data/study1/luna/id002/
data/study1/luna/id003/
data/study1/pupa/id001/
data/study1/pupa/id002/
data/study1/pupa/id003/
...

Additional data that need to be stored (e.g. verbose intermediate files) can be put under a special folder extra within the individual sub-folders, and will be skipped; e.g. below, file3 is ignored

data/study1/id003/{file1}
data/study1/id003/{file2}
data/study1/id003/extra/{file3}

File naming conventions

All data files should be named according to the convention:

   {domain}_{group}_{label}{_factor}{_factor}{_factor-level}{.txt}

Examples (in the eeg_spectral domain/group):

  eeg_spectral_psd_B_CH_SS-N1
  eeg_spectral_psd_B_CH_SS-N2
  eeg_spectral_psd_F_CH_SS-N2
  eeg_spectral_psd_E_F_CH_SS-N2.txt

Domain, group and label are required (i.e. always three underscore-delimited terms starting the filename)
Domain and group must correspond to a data dictionary specified on the command line, e.g. then dict/eeg_spectral.txt must exist)
The file extension .txt, if present, is ignored; files ending '~' are ignored completely
Optional factors in the file name (e.g. CH for channel) indicate a column in the body of the file that gives levels for that factor
Optional {factor-level} pairs, e.g. SS-N2, set that factor (SS) to that level (N2), i.e. as if a column existed in the body of the file where all values of SS were set to N2. Only the first hyphen is used to delimit the factor and level, so F--2 will set factor F to -2.

Variable naming conventions

Root variable names (e.g. PSD) can only contain alphanumeric characters and underscores
Factor names (e.g. B) can only contain alphanumeric characters and periods
In the data output and data-dictionary, root variables are expanded to contain any additional factor/level pairs

For example, if the following file

eeg_spectral_psd_B_CH_SS-N2.txt

contained the variable PSD (which was defined in the eeg_spectral domain/group data dictionary), then we might then find expanded variables such as:

PSD.B_ALPHA_CH_C3_SS_N2
PSD.B_ALPHA_CH_C4_SS_N2
PSD.B_BETA_CH_C3_SS_N2
PSD.B_BETA_CH_C4_SS_N2
...

That is, expanded variable names are in the form:

   {root}.{fac_lvl}_{fac_lvl}

All variable names will be converted to UPPERCASE; variable names cannot start with a digit or underscore, and cannot contain a period
Factor names have the same rules as standard variable names, except they can contain periods but not underscores
Levels (i.e. that are "values" rather than "names") can contain both underscores and periods underscores will be converted to periods (C3_A2 --> C3.A2) periods are common in level values, e.g. if the factor is frequency: F_12.25
e.g. this implies power for 12.25 Hz (F), for channel C3_A2 (CH), during N2 sleep (SS)
```
PSD.F_12.25_CH_C3.A2_SS_N2
```

All factors should be described in the data dictionary; merge will auto-generate data-dictionary entries for expanded variables based on the root variable label, and factors/levels (see examples).

Data-dictionary format

As well as residing in the folder specified on the command line, data dictionaries should adhere to the following naming convention:

{domain}_{group}.txt

e.g.

dict/eeg_spectral.txt
dict/macro_stages.txt

Possible domains (e.g. broad areas) are:

macro
eeg
ecg
resp
actigraphy

Domain and group names cannot contain underscore characters, as these are used to delimit domain, group and file tag/factor-lists. They can contain periods and hyphens.

To only extract/compile variables from specific domains, add those domain or domain_group identifiers to the command line, after the other options. For example, if the folders dict/ and study1 exist:

All domains/groups:

dmerge dict study1 study1.txt

Only eeg domain variables:

dmerge dict study1 study1.txt eeg

Only eeg_spectral and eeg_spindles groups, and all macro domain variables:

dmerge dict study1 study1.txt eeg_spectral eeg_spindles macro

Within a dictionary file, we expect 3 tab-delimited columns:

variable
type
label/description

Any lines starting with % are treated as comments and ignored.

Types are as follows:

Type	Description	Notes
`numeric`	Floating point values	Tested as a valid number; can use scientific notation (e or E)
`integer`	Integer values	Tested as an integer rather than float, i.e. digits and minus sign only, no decimal point/period character or scientific notation
`text`	Any text	No validation rules
`yesno`	Boolean value	Case-insensitive encodings: true = 1, y[es] and t[rue]; false = 0, n[o] and f[alse]
`factor`	A factor	Indicates that this is factor rather than a variable
`date`	Date	Simple check of XX/YY or XX/YY/ZZ encoding (i.e. 2 or 3 elements, delimited by either / or - characters)
`time`	Times	Simple check of HH:MM or HH:MM:SS encoding (i.e. 2 or 3 elements, delimited by either : or . characters)

Aliases

Data dictionaries can specify aliases for variable names. This can be useful, say, if different files have used different labels for the same variable. Alternatively, aliases can be used to resolve name clashes between domains/data-files, if two different variables have the same name. Use the keyword alias in the data dictionary as follows:

CH      factor   Channel
CNAME   alias    CH

This sets CNAME to be an alias for CH, where CH is said to be the canonical form.

Any time CNAME is encountered in any data-file of that domain, it will be as if CH were written there instead, i.e. so CH will appear in the output.
The tool will check that a canonical form has first been specified, i.e. appears earlier in the data dictionary.
Also, it will check that the same label does not appear as both an alias and a canonical form, or that the same alias is listed multiple times for different canonical forms.
As with all variable names, matches are case-insensitive, but all output will standardized for ALLCAPS output.
You can have multiple aliases for the same canonical form, however: e.g. here both CNAME and my.ch would be remapped to CH:
```
CH      factor   Channel
CNAME   alias    CH
my.ch   alias    CH
```

Data dictionary output format

The combined data are written to the file specified on the command line. The corresponding data-dictionary is written to stdout, using the following format:

Variable	Description
VAR	Variable name (e.g. DENS_F_11_CH_F3_SS_N2)
BASE	Variable root (e.g. `DENS`)
TYPE	Type of variable
OBS	Number of non-missing observations
GROUP	Group name
Factors...	Additional columns describe all factors encountered, e.g. band, channel and stage (B, CH and SS respectively), with the row as the corresponding level for that expanded variable
DOMAIN	Domain name
DESC	Description
COL	Column number in the data file (0 for factors, i.e. which do not appear as columns/variables in the data file)

Exclusions

Can exclude particular domains/groups by explicitly listing domains/groups to be included on the command line, after the other options (see example above)
Can exclude particular variables by setting name to -VAR in the data dictionary
Can exclude particular files by setting a -tag or '-tag-factorlist' on the command line (-HYPNO-C will skip macro_stages_HYPNO_C.txt)

Permissible character cheat sheet

Less complicated than it may look at first sight, here's a summary of identifier conventions:

Identifier	Allowed	Disallowed	Notes
All identifiers below	Alphanumeric, period, hyphen/minus & underscore	All other special characters (including spaces)	These conventions allow identifiers to be straightforwardly represented as both filenames (across different OS) and variable names (e.g. within a package such as R)
Data dictionary file	Alphanumeric, period, hyphen and underscore		An optional `.txt` extension is allowed, and will be ignored; underscores are used to delimit domain and group identifiers from file names.
Data dictionary domain/group	Alphanumeric, period and hyphen	Underscore	As above, underscores are used to delimit domain-group identifiers
Variable	Alphanumeric and underscore	Period, hyphen	Should start with a letter; upper and lowercase allowed, although all output will be transformed to uppercase
Factor	Alphanumeric and period	Underscore, hyphen	Should start with a letter; upper and lowercase allowed, although all output will be transformed to uppercase
Factor levels (data file body)	Alphanumeric, period, hyphen/minus & underscore		Underscores are allowed but are transformed to a period in the output (in variable names)
Factor levels (data file name)	Alphanumeric, period, hyphen/minus	Underscores	Levels specified in this way can contain periods and additional hyphens; they cannot contain underscores
Data file	Alphanumeric, period, hyphen/minus & underscore		Special format(s): `domain_group_tag` data w/out factors ('baseline') `domain_group_tag_F1_F2_F3` data w/ 3 factors F1, F2 & F3 `domain_group_tag_F1_F2-X_F3-Y` as above, but setting F2 to X and F3 to Y i) underscores delimit domain, group, file-tag and optional factor list ii) optionally, first hyphen in `FAC-LVL` sets factor `FAC` to level `LVL` A `.txt` file extension is allowed, and will be removed

Misc details

All data files must be ASCII plain-text, tab-delimited with a single header row
Expects individual/EDF ID is named ID
Checks that variable names (except factors) are unique within and across domains
Run with option -v to produce verbose output
Run with option -s to enfore strict mode: here all variables/factors/domains/groups present in the data folders must have an exact match in the data dictionary (otherwise an error is given and the program halted). Otherwise, in the default (non-strict) mode, such things are just skipped over silently (or, if -v is set, with a message to standard error stream).
Flag duplicate rows (same ID/factors) in and across datasets
Creates a data dictionary, with variable base-names (e.g. PSD) and factors (e.g. spectral band) expanded (e.g. PSD.SIGMA_C3_N2)
List the column numbers and count of non-missing observations for all individuals
Each domain can have its own missing data symbol (which are harmonized to a single value in the output; use a special MISSING variable name and TYPE in the data dictionary, where the description column gives the actual missing value code)
Merge based on case-insensitive matching; sets all variable numbers of UPPERCASE in output
Spaces and special characters in variable names are removed
Basic type enforcement checks on values
Counts missing data for each variable
Data-dictionary only needs to specify root (long-format) variables (e.g. PSD not PSD.B_SIGMA_CH_C3_SS_N2)
Throws an error if a variable is present but not described in the dictionary
Columns can be in any order across datasets
Even in strict mode, all variables must be in the data dictionary, but different individuals need not all have the same variables in each file (missing data will be indicated in the output)