# dmerge example

Below is a toy example of using `dmerge`, specifically designed to illustrate some features and the design logic. That is, for didactic reasons, this example is written to not work "out of the box", but rather to flag the types of things that might be encountered on real data.

## Data dictionaries

In this example we assume a single domain (just called `d1`) and two groups (here `g1` and `g2`). The data dictionaries are under a top-level folder `dict/`:

``````ls dict/
``````
``````d1_g1.txt       d1_g2.txt
``````

The first group `g1` has two variables `V1` and `V2` along with two stratifying factors (i.e. effectively defining the repeated measures of those variables for each individual):

``````cat dict/d1_g1.txt
``````
``````F1      factor  Factor 1
F2      factor  Factor 2
V1      num     Var 1
V2      num     Var 2
``````

The second group, `g2`, also has variables named `V1` and `V9`, along with factors `F1`, `F2` and also `SS`

``````cat dict/d1_g2.txt
``````
``````SS      factor  Sleep stage
F1      factor  Factor 1
F2      factor  Different label
V1      num     Var 1
V9      int     More variables
``````

Hint

Factors can be common across domains and groups, as these will contain common elements (channel, sleep stage, etc).

As we'll see below, variables (i.e. corresponding to the actual information) are defined as being specific to a domain/group, and it would be a problem to have the same label e.g. `N` for count appear in different contexts (i.e. different domains/groups) if it means different things (e.g. number of OSA events, number of arousals, number of spindles, etc). Thus, as we'll see, `dmerge` will spot this potential problem.

## The data

In this example we have two individuals, `id1` and `id2`. We require that each individual has one (or more) subfolders under a root project folder.

``````\$ ls studies/
id1     id2
``````

The first individual has two data files:

``````ls studies/id1
``````
``````d1_g1_s1_F1_F2.txt      d1_g2_s1_SS-N2.txt
``````

The second individual only has one:

``````ls studies/id2/
``````
``````d1_g1_s1_F1_F2.txt
``````

The filename convention tells us:

• which domain/group each dataset belongs, i.e. `d1_g1` or `d1_g2`

• the tag is just set to `s1` in these examples, as there is only a single file per domain/group in this trivial example

• additional underscore-delimited terms reflect the factors that apply for that dataset, e.g. `F1`, `F2`

• in one case, a level value `N2` is also ascribed to a factor (`SS`), implying that all rows should be assigned to this stratum, and that there is no column `SS` in the datafile itself; i.e. in real data, other files might imply different strata

Looking at the actual data:

``````\$ cat studies/id1/d1_g1_s1_F1_F2.txt
``````
``````ID      F1      F2      V1      V2
id1     A       X       1       2
id1     A       Y       3       4
id1     B       Y       5       U
``````

Note here that `V2` has a missing value set as a nonstandard value `U`. Certain values (`.`, `?`, `NA`, `NaN`, etc) are all treated as missing, but (given that `V2` has a numeric type defined, this value will cause a problem downstream, as we'll see.

``````\$ cat studies/id1/d1_g2_s1_SS-N2.txt
``````
``````ID      V1      V9      V10
id1     10      .       11
id1     NA      12      13
id1     14      15      ?
``````

In the second data file for `id1` (above), we have only standard missing value codes. Note a variable `V10` that was not defined in the corresponding data dictionary. Also note that `V1` is present here, but has different values from the first file. Also note that in this case, there are no stratifying variables included, but multiple (different) values of the same variable for the same individual, which is also clearly a problem (i.e. somebody forgot to include the relevant stratifiers in this output to distinguish rows 1, 2 and 3.

Finally, here we data for the second individual.

``````\$ cat studies/id2/d1_g1_s1_F1_F2.txt
``````
``````ID      F1      F2      V1      V2
id2     A       X       7       8
id2     B       Y       9       10
``````

## Running dmerge

Here the various runs needed to get a final results are done to illustrate a couple of features of `dmerge`.

### First run (namespaces of factors and variables)

This is the initial run of `dmerge` - which just points to the data dictionary folder (`dict/`) and the study data folder (`studies/`) and gives a file to be created, for the output to be collated in, `s1.txt`:

``````dmerge dict studies s1.txt
``````
`````` ++ adding domain d1::g1 (4 variables)
++ adding domain d1::g2 (5 variables)

*** error : inconsistent label for factor F2 across data dictionaries
``````

We get an error as `dmerge` notices that the `F2` factor is defined different (different descriptions) across dictionaries. In trying to harmonize data-files, this is obviously a problem: which should be used? This error therefore alerts the user to this issue, which has to be fixed. Looking up the relevant lines (which might not be trivial in a large project) here with `grep`:

``````grep F2 dict/*
``````
``````dict/d1_g1.txt:F2       factor  Factor 2
dict/d1_g2.txt:F2       factor  Different label
``````

So, either in one or the other of these files, you must make the description label identical, to enforce consistency across datafiles. This avoids, for example, `S` meaning signal in one set of outputs, but sleep stage in a second, which would lead to downstream problems.

### Second run (introducing aliases)

Trying again, we get a new error:

``````dmerge dict studies s1.txt
``````
`````` ++ adding domain d1::g1 (4 variables)
++ adding domain d1::g2 (5 variables)

*** error : V1 is duplicated across data dictionaries
``````

Looking up this variable across dictionaries:

``````grep V1 dict/*
``````
``````dict/d1_g1.txt:V1       num     Var 1
dict/d1_g2.txt:V1       num     Var 1
``````

As noted above, although the labels are identical, this is purposefully not allowed by the tool: variables are by definition specific to a domain and the names should be unique to a domain.

Hint

The same variable name can exist in diffferent data files within the same domain/group. e.g. `PSD` might exist in two files

`eeg_spec_avg_B.txt`

`eeg_spec_avg_B_SS.txt`

meaning that this measure is stratified by either band (`B`) or by both band and sleep stage (`SS`). The variable `PSD` would only feature once in the data dictionary `eeg_spec.txt` (along with factor definitions for `B` and `SS`), and the the program would correctly pull these together.

It would be burdensome to have to go back to the original files (which may have been generated by different tools/people, and may large and not easy to edit, etc) and so `dmerge` allows for aliases to be defined on the data dictionaries. For one of these domain/groups, we can effectively relabel the variable `V1` to something else when harmonizing. Here we edit `dict/d1_g2.txt`, to change the line:

``````V1     num   Var 1
``````
to become two lines
``````V1b   num     A new var 1
V1    alias   V1b
``````

That is, we first define a new variable `V1b`, which is specific to this domain, and then define an alias for `V1b` which is `V1`. This means that for any `g1_d2` datafile, any instance of `V1` is treated as if it were written `V1b`, and this avoids any potential naming conflicts.

### Third run (missing data codes)

Having fixed the above, we re-run:

``````dmerge dict studies s1.txt > s1.dict
``````
`````` ++ adding domain d1::g1 (4 variables)
++ adding domain d1::g2 (5 variables)
++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
domain    [ d1 ]
group     [ g1 ]
file-tag  [ s1 ]
variables [ V1 | V2 ]
factors   [ F1 | F2 ]

*** error : invalid value [U] for V2 (type Numeric)
in: studies/id1/d1_g1_s1_F1_F2.txt
``````

As noted above, this illustrates the simple type-checking features, spotting that `U` is not a valid numeric value. If we know it is a missing code used in the data file, we can add a line to the dictionary `dict/d1_g1.txt`: (here just two tab-delimited cols):

``````  missing       U
``````

### Fourth run (multiple conflicting values / missing strata)

Running again, we now see a new error:

``````dmerge dict studies s1.txt
``````
`````` ++ adding domain d1::g1 (4 variables)
++ adding domain d1::g2 (5 variables)
++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
domain    [ d1 ]
group     [ g1 ]
file-tag  [ s1 ]
variables [ V1 | V2 ]
factors   [ F1 | F2 ]
++ read 4 rows from data-file studies/id1/d1_g1_s1_F1_F2.txt
domain    [ d1 ]
group     [ g1 ]
file-tag  [ s1 ]
variables [ V1 | V2 ]
factors   [ F1 | F2 ]

*** error : multiple values for id1 V9.SS_N2
``````

This flags the issue we spotted above: in the data file. (The tool will spot if there are duplicate discordant values spread across multiple files also.) In this instance, it is clear this is due to a stratifying factor not being included in the file:

``````cat studies/id1/d1_g2_s1_SS-N2.txt
``````
``````ID      V1      V9      V10
id1     10      .       11
id1     NA      12      13
id1     14      15      ?
``````

If we were to go back and correct the original data, which would be necessary here, say we instead have this:

``````cat studies/id1/d1_g2_s1_SS-N2.txt
``````
``````ID      V1      V9      V10     F2
id1     10      .       11      X
id1     NA      12      13      Y
id1     14      15      ?       Z
``````

i.e. we've made each row unique by adding the missing factor, `F2`, and so there should be no conflicts now. However, if we were to re-run as is, we'd still get the same error. Why? This is because the filename convention has not specified that `F2` is a factor for this file.

Note

Yes, that `F2` is a factor could be inferred from `dict/d1_g2.txt`, but the tool purposefully has the model that the dictionaries must contain a complete representation of the truth of the data, but also requires a second level of consistency (here that filenames match). The design logic is that, at the cost of a marginally more involved set-up, it makes it more robust downstream, and less likley to have subtle errors when merging across different datafiles.

We therefore would need to also change the name of the datafile as well as the contents to reflect the status of `F2`:

``````mv studies/id1/d1_g2_s1_SS-N2.txt studies/id1/d1_g2_s1_F2_SS-N2.txt
``````

### Fifth run: validation

The fifth run will now work. Again, that it "failed" the first four times is not reflecting problems with the tool -- rather, think of it as giving feedback to enfore a set of conventions that help for data harmonization.

``````dmerge dict studies s1.txt > s1.dict
``````
`````` ++ adding domain d1::g1 (4 variables)
++ adding domain d1::g2 (5 variables)
++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
domain    [ d1 ]
group     [ g1 ]
file-tag  [ s1 ]
variables [ V1 | V2 ]
factors   [ F1 | F2 ]
++ read 4 rows from data-file studies/id1/d1_g1_s1_F1_F2.txt
domain    [ d1 ]
group     [ g1 ]
file-tag  [ s1 ]
variables [ V1 | V2 ]
factors   [ F1 | F2 ]
++ read 4 rows from data-file studies/id1/d1_g2_s1_F2_SS-N2.txt
domain    [ d1 ]
group     [ g2 ]
file-tag  [ s1 ]
variables [ V1B | V9 | V10 (skipped) ]
factors   [ F2 | SS = N2 ]

finished: processed 2 individuals across 3 files, yielding 12 (expanded) variables
``````

Here we've also saved the data dictionary to a file `s1.dict` as well as the actual data, in `s1.txt`.

For reference, the final data dictionaries and files are:

``````cat dict/d1_g1.txt
``````
``````F1      factor  Factor 1
F2      factor  Factor 2
V1      num     Var 1
V2      num     Var 2
missing U
``````

``````\$ cat dict/d1_g2.txt
``````
``````SS      factor  Sleep stage
F1      factor  Factor 1
F2      factor  Factor 2
V1b     num     A new var 1
V1      alias   V1b
V9      int     More variables
``````

``````cat studies/id1/d1_g1_s1_F1_F2.txt
``````
``````ID      F1      F2      V1      V2
id1     A       X       1       2
id1     A       Y       3       4
id1     B       Y       5       U
``````

``````cat studies/id1/d1_g2_s1_F2_SS-N2.txt
``````
``````ID      V1      V9      V10     F2
id1     10      .       11      X
id1     NA      12      13      Y
id1     14      15      ?       Z
``````

``````cat studies/id2/d1_g1_s1_F1_F2.txt
``````
``````ID      F1      F2      V1      V2
id2     A       X       7       8
id2     B       Y       9       10
``````

We can look at the `s1.txt` file (here using Luna's `behead` utility to make it more human readable):

``````cat s1.txt | behead
``````
``````                       ID   id1
V1.F1_A_F2_X   1
V1.F1_A_F2_Y   3
V1.F1_B_F2_Y   5
V1B.F2_X_SS_N2   10
V1B.F2_Y_SS_N2   NA
V1B.F2_Z_SS_N2   14
V2.F1_A_F2_X   2
V2.F1_A_F2_Y   4
V2.F1_B_F2_Y   NA
V9.F2_X_SS_N2   NA
V9.F2_Y_SS_N2   12
V9.F2_Z_SS_N2   15

ID   id2
V1.F1_A_F2_X   7
V1.F1_A_F2_Y   NA
V1.F1_B_F2_Y   9
V1B.F2_X_SS_N2   NA
V1B.F2_Y_SS_N2   NA
V1B.F2_Z_SS_N2   NA
V2.F1_A_F2_X   8
V2.F1_A_F2_Y   NA
V2.F1_B_F2_Y   10
V9.F2_X_SS_N2   NA
V9.F2_Y_SS_N2   NA
V9.F2_Z_SS_N2   NA
``````

Note that `V10` is not present. This was noted in the console output above:

``````      variables [ V1B | V9 | V10 (skipped) ]
``````

i.e. as `V10` was not described in any data dictionary (or it was commented out, or the domain/group was skipped on the command line):

We can run `dmerge` in strict mode, where there must be an exact one-to-one matching between the contents of the data dictionaries and the contents of the data files, using the `-s` option:

``````dmerge dict studies s1.txt -s > s1.dict
``````
This would indeed yield the error:
``````*** error : V10 not specified in data-dictionary for studies/id1/d1_g2_s1_F2_SS-N2.txt
``````

## Generated data-dictionaries

`dmerge` generates a bespoke data dictionary to exactly match the output `s1.txt`. We saved it as `s1.dict`, it is a tab-delimited tabular file, e.g. that can easily be loaded into R, etc.

``````COL     VAR     BASE    OBS     DOMAIN  GROUP   TYPE    DESC    F1      F2      SS
0       F1      .       .       .       .       Factor  Factor 1        .       .       .
0       F2      .       .       .       .       Factor  Factor 2        .       .       .
0       SS      .       .       .       .       Factor  Sleep stage     .       .       .
1       ID      .       2       .       .       ID      Individual ID   .       .       .
2       V1.F1_A_F2_X    V1      2       d1      g1      Numeric Var 1 (F1=A, F2=X)      A       X       .
3       V1.F1_A_F2_Y    V1      1       d1      g1      Numeric Var 1 (F1=A, F2=Y)      A       Y       .
4       V1.F1_B_F2_Y    V1      2       d1      g1      Numeric Var 1 (F1=B, F2=Y)      B       Y       .
5       V1B.F2_X_SS_N2  V1B     1       d1      g2      Numeric Var 1 (F2=X, SS=N2)     .       X       N2
6       V1B.F2_Y_SS_N2  V1B     0       d1      g2      Numeric Var 1 (F2=Y, SS=N2)     .       Y       N2
7       V1B.F2_Z_SS_N2  V1B     1       d1      g2      Numeric Var 1 (F2=Z, SS=N2)     .       Z       N2
8       V2.F1_A_F2_X    V2      2       d1      g1      Numeric Var 2 (F1=A, F2=X)      A       X       .
9       V2.F1_A_F2_Y    V2      1       d1      g1      Numeric Var 2 (F1=A, F2=Y)      A       Y       .
10      V2.F1_B_F2_Y    V2      1       d1      g1      Numeric Var 2 (F1=B, F2=Y)      B       Y       .
11      V9.F2_X_SS_N2   V9      0       d1      g2      Integer More variables (F2=X, SS=N2)    .       X       N2
12      V9.F2_Y_SS_N2   V9      1       d1      g2      Integer More variables (F2=Y, SS=N2)    .       Y       N2
13      V9.F2_Z_SS_N2   V9      1       d1      g2      Integer More variables (F2=Z, SS=N2)    .       Z       N2
``````

Each column in `s1.txt` (`COL`) is described here, with the matching expanded variable names (`VAR`). The descriptions are also expanded, and columns in `s.dict` are added to correspond to factors from the data (`F1`, `F2`, etc) with values that match the level correspondiong to that variable. That is, variables are created by combining the base (`BASE`) along with the factor/level pairs, in the form: `BASE.FAC_LVL_FAC_LVL`

### Other options

We can check that expanded variables names do not get too long (e.g. if some stats programs have a hard limit, say 32 characters). Here we specify a max of 8 characters:

``````dmerge dict studies s1.txt -ml=8 > s1.dict
``````
`````` ** variable name exceeds 8 characters: V1.F1_A_F2_X
** variable name exceeds 8 characters: V1.F1_A_F2_Y
** variable name exceeds 8 characters: V1.F1_B_F2_Y
** variable name exceeds 8 characters: V1B.F2_X_SS_N2
** variable name exceeds 8 characters: V1B.F2_Y_SS_N2
** variable name exceeds 8 characters: V1B.F2_Z_SS_N2
** variable name exceeds 8 characters: V2.F1_A_F2_X
** variable name exceeds 8 characters: V2.F1_A_F2_Y
** variable name exceeds 8 characters: V2.F1_B_F2_Y
** variable name exceeds 8 characters: V9.F2_X_SS_N2
** variable name exceeds 8 characters: V9.F2_Y_SS_N2
** variable name exceeds 8 characters: V9.F2_Z_SS_N2

*** error : variables too long... options:
- change max. allowed length with -ml=999 option
- do not show factors with -nofac
- use aliases in data dictionaries
- or use numeric strata codes with -ns
``````

Of the suggestions above, one is to omit the factor names from expanded variable names, to make them shorter.

``````dmerge dict studies s1.txt -nofac > s1.dict
``````

``````V1.F1_A_F2_X
V1.F1_A_F2_Y
V1.F1_B_F2_Y
``````
for example, this now gives:
``````V1.A_X
V1.A_Y
V1.B_Y
``````

Alternatively, we can just encode strata numerically, with the `-ns` option:

``````dmerge dict studies s1.txt -ns > s1.dict
``````
in which case the (arbitrary) uniqifying numbers are as follows:
`````` V1.1
V1.3
V1.5
``````

Here, the codes may vary from run to run, so it would be important to match the particular data dictionary (`s1.dict`) with this file, i.e. so that the "meaning" of `V1.3` can be tracked.

### Working with generated data dictionaries

As noted, the generated data dictionary is designed to be easily analyzable, i.e. to use `s1.dict` hand-in-hand to help analyse the data in `s1.txt`. For example, loading it in R:

``````dd <- read.table( "s1.dict" , header=T , stringsAsFactors=F , sep="\t" )
``````
along with the actual data:
``````d <- read.table("s1.txt" , header=T, stringsAsFactors=F )
``````

One can imagine querying `dd` to pull out the expanded variable names (`dd\$VAR`) which match the column names in `d`. For example, to get the `V1` variable(s) where the `F1` is level `A`:

``````v <- dd\$VAR[ dd\$BASE == "V1" & d\$F1 == "A" ]
v
``````
`````` "V1.1" "V1.5"
``````
e.g. to pull out these columns:

``````d[ , v ]
``````
``````V1.1 V1.5
1    1    3
2    7   NA
``````

One can imagine writing simple convenience functions to assist with this: e.g.

``````fvars <- function( dd , bases = NULL , facs = NULL )
{
inc <- rep( T , dim(dd) )
if ( ! is.null( bases ) )
inc[ ! dd\$BASE %in% bases ] <- F
if ( is.list( facs ) )
for (f in names(facs) )
inc[ ! dd[,f] %in% facs[[f]] ] <- F
dd\$VAR[ inc ]
}
``````

All `V1` variables:

``````fvars( d , "V1" )
``````
`````` "V1.1" "V1.3" "V1.5"
``````

`V1` variables where `F1` is `A` and `F2` is `X` or `Y`:

``````fvars( d , "V1" , list( F1="A", F2=c("X","Y") ) )
``````
`````` "V1.1" "V1.5"
``````
etc. (note: in practice, such as convenience function might also give greater control over the and/or logic of selecting specific variables, this is just for illustration).