dmerge example
Below is a toy example of using dmerge, specifically designed to
illustrate some features and the design logic.  That is, for didactic
reasons, this example is written to not work "out of the box", but
rather to flag the types of things that might be encountered on real
data.
Data dictionaries
In this example we assume a single domain (just called d1) and
two groups (here g1 and g2).  The data dictionaries are under a
top-level folder dict/:
ls dict/
d1_g1.txt       d1_g2.txt
The first group g1 has two variables V1 and V2 along with two
stratifying factors (i.e. effectively defining the repeated measures
of those variables for each individual):
cat dict/d1_g1.txt
F1      factor  Factor 1
F2      factor  Factor 2
V1      num     Var 1
V2      num     Var 2
The second group, g2, also has variables named V1 and V9, along with factors F1, F2 and also SS
cat dict/d1_g2.txt
SS      factor  Sleep stage
F1      factor  Factor 1
F2      factor  Different label
V1      num     Var 1
V9      int     More variables
Hint
Factors can be common across domains and groups, as these will contain common elements (channel, sleep stage, etc).
As we'll see below, variables (i.e. corresponding to the actual
information) are defined as being specific to a domain/group, and it
would be a problem to have the same label e.g. N for count appear in
different contexts (i.e. different domains/groups) if it means
different things (e.g. number of OSA events, number of arousals,
number of spindles, etc).  Thus, as we'll see, dmerge will spot
this potential problem.
The data
In this example we have two individuals, id1 and id2.  We require
that each individual has one (or more) subfolders under a root
project folder.
$ ls studies/
id1     id2
The first individual has two data files:
ls studies/id1
d1_g1_s1_F1_F2.txt      d1_g2_s1_SS-N2.txt
The second individual only has one:
ls studies/id2/
d1_g1_s1_F1_F2.txt
The filename convention tells us:
- 
which domain/group each dataset belongs, i.e. d1_g1ord1_g2
- 
the tag is just set to s1in these examples, as there is only a single file per domain/group in this trivial example
- 
additional underscore-delimited terms reflect the factors that apply for that dataset, e.g. F1,F2
- 
in one case, a level value N2is also ascribed to a factor (SS), implying that all rows should be assigned to this stratum, and that there is no columnSSin the datafile itself; i.e. in real data, other files might imply different strata
Looking at the actual data:
$ cat studies/id1/d1_g1_s1_F1_F2.txt
ID      F1      F2      V1      V2
id1     A       X       1       2
id1     A       Y       3       4
id1     B       Y       5       U
Note here that V2 has a missing value set as a nonstandard value
U.  Certain values (., ?, NA, NaN, etc) are all treated as
missing, but (given that V2 has a numeric type defined, this value
will cause a problem downstream, as we'll see.
$ cat studies/id1/d1_g2_s1_SS-N2.txt
ID      V1      V9      V10
id1     10      .       11
id1     NA      12      13
id1     14      15      ?
In the second data file for id1 (above), we have only standard missing value codes.  Note a
variable V10 that was not defined in the corresponding data
dictionary. Also note that V1 is present here, but has different
values from the first file.  Also note that in this case, there are no
stratifying variables included, but multiple (different) values of the
same variable for the same individual, which is also clearly a
problem (i.e. somebody forgot to include the relevant stratifiers in this output
to distinguish rows 1, 2 and 3.
Finally, here we data for the second individual.
$ cat studies/id2/d1_g1_s1_F1_F2.txt
ID      F1      F2      V1      V2
id2     A       X       7       8
id2     B       Y       9       10
Running dmerge
Here the various runs needed to get a final results are done to illustrate a couple of features of dmerge.
First run (namespaces of factors and variables)
This is the initial run of dmerge - which just points to the data dictionary folder (dict/) and the study data folder (studies/) and gives a file to be created, for the output to be collated in, s1.txt:
dmerge dict studies s1.txt
 ++ adding domain d1::g1 (4 variables)
 ++ adding domain d1::g2 (5 variables)
*** error : inconsistent label for factor F2 across data dictionaries
We get an error as dmerge notices that the F2 factor is defined different (different descriptions) across dictionaries.  In trying to harmonize data-files, this is obviously a problem: which should be used?    This error therefore alerts the user to this issue, which has to be fixed.   Looking up the relevant lines (which might not be trivial in a large project) here with grep:
grep F2 dict/*
dict/d1_g1.txt:F2       factor  Factor 2
dict/d1_g2.txt:F2       factor  Different label
So, either in one or the other of these files, you must make the
description label identical, to enforce consistency across datafiles.
This avoids, for example, S meaning signal in one set of outputs,
but sleep stage in a second, which would lead to downstream
problems.
Second run (introducing aliases)
Trying again, we get a new error:
dmerge dict studies s1.txt
 ++ adding domain d1::g1 (4 variables)
 ++ adding domain d1::g2 (5 variables)
*** error : V1 is duplicated across data dictionaries
Looking up this variable across dictionaries:
grep V1 dict/*
dict/d1_g1.txt:V1       num     Var 1
dict/d1_g2.txt:V1       num     Var 1
As noted above, although the labels are identical, this is purposefully not allowed by the tool: variables are by definition specific to a domain and the names should be unique to a domain.
Hint
The same variable name can exist in diffferent data files within the same domain/group.
e.g. PSD might exist in two files
eeg_spec_avg_B.txt
eeg_spec_avg_B_SS.txt
meaning that this measure is stratified by either band (B) or
by both band and sleep stage (SS).  The variable PSD would
only feature once in the data dictionary eeg_spec.txt (along
with factor definitions for B and SS), and the the program
would correctly pull these together.
It would be burdensome to have to go back to the original files (which
may have been generated by different tools/people, and may large and
not easy to edit, etc) and so dmerge allows for aliases to be
defined on the data dictionaries. For one of these domain/groups, we
can effectively relabel the variable V1 to something else when
harmonizing.   Here we edit dict/d1_g2.txt, to change the line:
V1     num   Var 1
V1b   num     A new var 1
V1    alias   V1b
That is, we first define a new variable V1b, which is specific to
this domain, and then define an alias for V1b which is V1.  This
means that for any g1_d2 datafile, any instance of V1 is treated
as if it were written V1b, and this avoids any potential naming
conflicts.
Third run (missing data codes)
Having fixed the above, we re-run:
dmerge dict studies s1.txt > s1.dict
 ++ adding domain d1::g1 (4 variables)
 ++ adding domain d1::g2 (5 variables)
 ++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
      domain    [ d1 ]
      group     [ g1 ]
      file-tag  [ s1 ]
      variables [ V1 | V2 ]
      factors   [ F1 | F2 ]
*** error : invalid value [U] for V2 (type Numeric)
    in: studies/id1/d1_g1_s1_F1_F2.txt
As noted above, this illustrates the simple type-checking features,
spotting that U is not a valid numeric value.  If we know it is a
missing code used in the data file, we can add a line to the
dictionary dict/d1_g1.txt: (here just two tab-delimited cols):
  missing       U
Fourth run (multiple conflicting values / missing strata)
Running again, we now see a new error:
dmerge dict studies s1.txt
 ++ adding domain d1::g1 (4 variables)
 ++ adding domain d1::g2 (5 variables)
 ++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
      domain    [ d1 ]
      group     [ g1 ]
      file-tag  [ s1 ]
      variables [ V1 | V2 ]
      factors   [ F1 | F2 ]
 ++ read 4 rows from data-file studies/id1/d1_g1_s1_F1_F2.txt
      domain    [ d1 ]
      group     [ g1 ]
      file-tag  [ s1 ]
      variables [ V1 | V2 ]
      factors   [ F1 | F2 ]
*** error : multiple values for id1 V9.SS_N2
This flags the issue we spotted above: in the data file. (The tool will spot if there are duplicate discordant values spread across multiple files also.) In this instance, it is clear this is due to a stratifying factor not being included in the file:
cat studies/id1/d1_g2_s1_SS-N2.txt
ID      V1      V9      V10
id1     10      .       11
id1     NA      12      13
id1     14      15      ?
If we were to go back and correct the original data, which would be necessary here, say we instead have this:
cat studies/id1/d1_g2_s1_SS-N2.txt
ID      V1      V9      V10     F2
id1     10      .       11      X
id1     NA      12      13      Y
id1     14      15      ?       Z
i.e. we've made each row unique by adding the missing factor, F2,
and so there should be no conflicts now.  However, if we were to
re-run as is, we'd still get the same error.  Why?  This is because
the filename convention has not specified that F2 is a factor for
this file.
Note
Yes, that F2 is a factor could be inferred from
dict/d1_g2.txt, but the tool purposefully has the model that the
dictionaries must contain a complete representation of the truth
of the data, but also requires a second level of consistency (here
that filenames match). The design logic is that, at the cost of a
marginally more involved set-up, it makes it more robust
downstream, and less likley to have subtle errors when merging
across different datafiles.
We therefore would need to also change the name of the datafile as
well as the contents to reflect the status of F2:
mv studies/id1/d1_g2_s1_SS-N2.txt studies/id1/d1_g2_s1_F2_SS-N2.txt
Fifth run: validation
The fifth run will now work. Again, that it "failed" the first four times is not reflecting problems with the tool -- rather, think of it as giving feedback to enfore a set of conventions that help for data harmonization.
dmerge dict studies s1.txt > s1.dict
 ++ adding domain d1::g1 (4 variables)
 ++ adding domain d1::g2 (5 variables)
 ++ read 3 rows from data-file studies/id2/d1_g1_s1_F1_F2.txt
      domain    [ d1 ]
      group     [ g1 ]
      file-tag  [ s1 ]
      variables [ V1 | V2 ]
      factors   [ F1 | F2 ]
 ++ read 4 rows from data-file studies/id1/d1_g1_s1_F1_F2.txt
      domain    [ d1 ]
      group     [ g1 ]
      file-tag  [ s1 ]
      variables [ V1 | V2 ]
      factors   [ F1 | F2 ]
 ++ read 4 rows from data-file studies/id1/d1_g2_s1_F2_SS-N2.txt
      domain    [ d1 ]
      group     [ g2 ]
      file-tag  [ s1 ]
      variables [ V1B | V9 | V10 (skipped) ]
      factors   [ F2 | SS = N2 ]
finished: processed 2 individuals across 3 files, yielding 12 (expanded) variables
Here we've also saved the data dictionary to a file s1.dict as well as the actual data, in s1.txt.
For reference, the final data dictionaries and files are:
cat dict/d1_g1.txt
F1      factor  Factor 1
F2      factor  Factor 2
V1      num     Var 1
V2      num     Var 2
missing U
$ cat dict/d1_g2.txt
SS      factor  Sleep stage
F1      factor  Factor 1
F2      factor  Factor 2
V1b     num     A new var 1
V1      alias   V1b
V9      int     More variables
cat studies/id1/d1_g1_s1_F1_F2.txt
ID      F1      F2      V1      V2
id1     A       X       1       2
id1     A       Y       3       4
id1     B       Y       5       U
cat studies/id1/d1_g2_s1_F2_SS-N2.txt
ID      V1      V9      V10     F2
id1     10      .       11      X
id1     NA      12      13      Y
id1     14      15      ?       Z
cat studies/id2/d1_g1_s1_F1_F2.txt
ID      F1      F2      V1      V2
id2     A       X       7       8
id2     B       Y       9       10
We can look at the s1.txt file (here using Luna's behead utility to make it more human readable):
cat s1.txt | behead
                       ID   id1
             V1.F1_A_F2_X   1
             V1.F1_A_F2_Y   3
             V1.F1_B_F2_Y   5
           V1B.F2_X_SS_N2   10
           V1B.F2_Y_SS_N2   NA
           V1B.F2_Z_SS_N2   14
             V2.F1_A_F2_X   2
             V2.F1_A_F2_Y   4
             V2.F1_B_F2_Y   NA
            V9.F2_X_SS_N2   NA
            V9.F2_Y_SS_N2   12
            V9.F2_Z_SS_N2   15
                       ID   id2
             V1.F1_A_F2_X   7
             V1.F1_A_F2_Y   NA
             V1.F1_B_F2_Y   9
           V1B.F2_X_SS_N2   NA
           V1B.F2_Y_SS_N2   NA
           V1B.F2_Z_SS_N2   NA
             V2.F1_A_F2_X   8
             V2.F1_A_F2_Y   NA
             V2.F1_B_F2_Y   10
            V9.F2_X_SS_N2   NA
            V9.F2_Y_SS_N2   NA
            V9.F2_Z_SS_N2   NA
Note that V10 is not present.  This was noted in the console output above:
      variables [ V1B | V9 | V10 (skipped) ]
i.e. as V10 was not described in any data dictionary (or it was
commented out, or the domain/group was skipped on the command line):
We can run dmerge in strict mode, where there must be an exact one-to-one matching between the contents of the data dictionaries and the contents of the data files, using the -s option:
dmerge dict studies s1.txt -s > s1.dict
*** error : V10 not specified in data-dictionary for studies/id1/d1_g2_s1_F2_SS-N2.txt
Generated data-dictionaries
dmerge generates a bespoke data dictionary to exactly match the
output s1.txt.  We saved it as s1.dict, it is a tab-delimited
tabular file, e.g. that can easily be loaded into R, etc.
COL     VAR     BASE    OBS     DOMAIN  GROUP   TYPE    DESC    F1      F2      SS
0       F1      .       .       .       .       Factor  Factor 1        .       .       .
0       F2      .       .       .       .       Factor  Factor 2        .       .       .
0       SS      .       .       .       .       Factor  Sleep stage     .       .       .
1       ID      .       2       .       .       ID      Individual ID   .       .       .
2       V1.F1_A_F2_X    V1      2       d1      g1      Numeric Var 1 (F1=A, F2=X)      A       X       .
3       V1.F1_A_F2_Y    V1      1       d1      g1      Numeric Var 1 (F1=A, F2=Y)      A       Y       .
4       V1.F1_B_F2_Y    V1      2       d1      g1      Numeric Var 1 (F1=B, F2=Y)      B       Y       .
5       V1B.F2_X_SS_N2  V1B     1       d1      g2      Numeric Var 1 (F2=X, SS=N2)     .       X       N2
6       V1B.F2_Y_SS_N2  V1B     0       d1      g2      Numeric Var 1 (F2=Y, SS=N2)     .       Y       N2
7       V1B.F2_Z_SS_N2  V1B     1       d1      g2      Numeric Var 1 (F2=Z, SS=N2)     .       Z       N2
8       V2.F1_A_F2_X    V2      2       d1      g1      Numeric Var 2 (F1=A, F2=X)      A       X       .
9       V2.F1_A_F2_Y    V2      1       d1      g1      Numeric Var 2 (F1=A, F2=Y)      A       Y       .
10      V2.F1_B_F2_Y    V2      1       d1      g1      Numeric Var 2 (F1=B, F2=Y)      B       Y       .
11      V9.F2_X_SS_N2   V9      0       d1      g2      Integer More variables (F2=X, SS=N2)    .       X       N2
12      V9.F2_Y_SS_N2   V9      1       d1      g2      Integer More variables (F2=Y, SS=N2)    .       Y       N2
13      V9.F2_Z_SS_N2   V9      1       d1      g2      Integer More variables (F2=Z, SS=N2)    .       Z       N2
Each column in s1.txt (COL) is described here, with the matching
expanded variable names (VAR).  The descriptions are also
expanded, and columns in s.dict are added to correspond to factors
from the data (F1, F2, etc) with values that match the level
correspondiong to that variable.  That is, variables are created by
combining the base (BASE) along with the factor/level pairs, in
the form: BASE.FAC_LVL_FAC_LVL
Other options
We can check that expanded variables names do not get too long (e.g. if some stats programs have a hard limit, say 32 characters). Here we specify a max of 8 characters:
dmerge dict studies s1.txt -ml=8 > s1.dict
 ** variable name exceeds 8 characters: V1.F1_A_F2_X
 ** variable name exceeds 8 characters: V1.F1_A_F2_Y
 ** variable name exceeds 8 characters: V1.F1_B_F2_Y
 ** variable name exceeds 8 characters: V1B.F2_X_SS_N2
 ** variable name exceeds 8 characters: V1B.F2_Y_SS_N2
 ** variable name exceeds 8 characters: V1B.F2_Z_SS_N2
 ** variable name exceeds 8 characters: V2.F1_A_F2_X
 ** variable name exceeds 8 characters: V2.F1_A_F2_Y
 ** variable name exceeds 8 characters: V2.F1_B_F2_Y
 ** variable name exceeds 8 characters: V9.F2_X_SS_N2
 ** variable name exceeds 8 characters: V9.F2_Y_SS_N2
 ** variable name exceeds 8 characters: V9.F2_Z_SS_N2
*** error : variables too long... options:
 - change max. allowed length with -ml=999 option
 - do not show factors with -nofac
 - use aliases in data dictionaries
 - or use numeric strata codes with -ns
Of the suggestions above, one is to omit the factor names from expanded variable names, to make them shorter.
dmerge dict studies s1.txt -nofac > s1.dict
Instead of:
V1.F1_A_F2_X
V1.F1_A_F2_Y
V1.F1_B_F2_Y
V1.A_X
V1.A_Y
V1.B_Y
Alternatively, we can just encode strata numerically, with the -ns option:
dmerge dict studies s1.txt -ns > s1.dict
 V1.1
 V1.3
 V1.5
Here, the codes may vary from run to run, so it would be important to
match the particular data dictionary (s1.dict) with this file,
i.e. so that the "meaning" of V1.3 can be tracked.
Working with generated data dictionaries
As noted, the generated data dictionary is designed to be easily
analyzable, i.e. to use s1.dict hand-in-hand to help analyse the
data in s1.txt.  For example, loading it in R:
dd <- read.table( "s1.dict" , header=T , stringsAsFactors=F , sep="\t" ) 
d <- read.table("s1.txt" , header=T, stringsAsFactors=F ) 
One can imagine querying dd to pull out the expanded variable names (dd$VAR) which match the column names in d.  For example, to get the V1 variable(s) where the F1 is level A:
v <- dd$VAR[ dd$BASE == "V1" & d$F1 == "A" ]
v
[1] "V1.1" "V1.5"
d[ , v ]
V1.1 V1.5
1    1    3
2    7   NA
One can imagine writing simple convenience functions to assist with this: e.g.
fvars <- function( dd , bases = NULL , facs = NULL )
{
 inc <- rep( T , dim(dd)[1] )
 if ( ! is.null( bases ) )
  inc[ ! dd$BASE %in% bases ] <- F
 if ( is.list( facs ) )
  for (f in names(facs) )
   inc[ ! dd[,f] %in% facs[[f]] ] <- F
 dd$VAR[ inc ]
}
All V1 variables:
fvars( d , "V1" )
[1] "V1.1" "V1.3" "V1.5"
V1 variables where F1 is A and F2 is X or Y:
fvars( d , "V1" , list( F1="A", F2=c("X","Y") ) )
[1] "V1.1" "V1.5"