PLINK includes a set of utitily options designed to help manage ID-related project data. In large projects, ID schemes can be difficult to manage. This set of options is aimed at scenarios in which individuals have been assigned multiple IDs, meaning that multiple lookup tables are needed to translate between schemes, although more basic tasks (e.g. joining multiple files based on a single shared ID) are supported. In particular, these options will:

Combine multiple (partially overlapping) ID schemes
Spot inconsistencies
Track other (non-unique) attributes along with identifier information
Filter subsets of this database, for quick look-ups
Allow ID aliases
Allow individuals to be uniquely specified by two or more IDs, such as family ID and individual ID
Automatically collate and update ID schemes in external files
Merge multiple files based on multiple ID schemes

These functions are generic, in that they are not tied to any particular format or scheme of IDs used by PLINK. In fact, the "individuals" need not be samples, but could be anything, e.g. SNPs with RS numbers and vendor-specific codes. These options are specifically aimed for cases where ID data, along with limited amounts of secondary attributes (e.g. sex, age, etc; or chromosome, map position in the case of SNPs, etc) are stored in flat, rectangular text files.

Obviously there are many other ways to perform such tasks, for example, using any standard relational database, a perl script or Excel. Depending on your needs, you may or may not find the options implemented here quicker, easier or more reliable than some these alternatives.

Example of usage

As an example: consider the following case, in which ID information is spread across four files: family and individual IDs from two sites, collab12.txt (site ID, family ID and individual ID)

     1    F00001  1
     1    F00001  2
     1    F00001  3
     1    F00002  1
     1    F00002  2
     1    F00002  3
     2    C101    P1
     2    C101    M2
     2    C101    C2

Similar information from a third sample, but with some additional information appended:

     3 1 F00001_1    3/12/09         F
     3 1 F00001_2    NA              NA
     3 1 F00001_3    3/17/09         M

Then we have a report back from the genotyping lab, on some of the samples (and which also includes some other samples)

     SITE FID       IID    GENO   PASS
     1    F00001      1    S001      Y
     1    F00001      2    S002      Y
     1    F00001      3    S003      Y
     1    F00002      2    S004      N
     1    F00002      3    S005      Y
     2    C101       P1    S006      Y
     2    C101       M2    S007      N
     2    C101       C2    S008      Y
     2    X1         X1    S009      Y

Finally, we also have information on yet a further set of IDs assigned in a follow-up stage of the project, that are tied to the IDs assigned by the genotyping lab, rather than the original collaborator IDs:

     S001    fu_01_a
     S002    fu_01_b
     S003    fu_01_c
     S005    fu_01_d
     S006    fu_01_e
     S008    fu_01_f
     S009    fu_01_g

As described, below, the following dictionary file (proj1.dict) is specified to track this information:

     collab12.txt  SITE FID IID           : joint=SITE,FID,IID
     collab3.txt   SITE FID IID DATE SEX  : attrib=DATE,SEX missing=NA
     geno.txt      SITE FID IID GENO PASS : attrib=PASS header
     followup.txt  GENO FUID

and the command

plink --id-dict proj1.dict

will collate all the files (after checking for inconsistencies, etc) into a single table, with missing values inserted where appropriate:

       DATE        FID        FUID     GENO          IID  PASS   SEX  SITE
          .     F00001     fu_01_a     S001            1     Y     .     1
          .     F00001     fu_01_b     S002            2     Y     .     1
          .     F00001     fu_01_c     S003            3     Y     .     1
          .     F00002           .        .            1     .     .     1
          .     F00002           .     S004            2     N     .     1
          .     F00002     fu_01_d     S005            3     Y     .     1
          .       C101     fu_01_e     S006           P1     Y     .     2
          .       C101           .     S007           M2     N     .     2
          .       C101     fu_01_f     S008           C2     Y     .     2
    3/12/09          1           .        .     F00001_1     .     F     3
          .          1           .        .     F00001_2     .     .     3
    3/17/09          1           .        .     F00001_3     .     M     3
          .         X1     fu_01_g     S009           X1     Y     .     2

There are then numerous commands that can search this database, and update or match external files based on any of the ID schemes. There is also a command for joining two or more files based on a single ID scheme, which does not require a dictionary/database to be specified. This could be of use, for example, to quickly line up partially overlapping output from PLINK, based on SNP RS numbers, for example.

Overview

The idea is that all data are kept in simple plain text files, and that the complete "master file" is then generated on-the-fly. This makes it easier to add and edit individual components of the ID database (i.e. the individual files).

Note In contrast to a full database, there is no support for hierarchical, relational data structures. That is, all observations in all tables must be of the same fundamental unit (e.g. a single individual).

Consider we have three sets of IDs, labelled A, B and C, on up to four individuals. These are described across two files, id1.txt, which lists the A and B schemes (coded here for clarity to simply be a1, a2, etc)

and id2.txt, which contains the B and C codes for 3 individuals:

     b2 c2
     b1 c1
     b3 c3

For example, the individual labelled a1 under the A scheme is called b1 under the B scheme. Note that in id2.txt the individuals are in a different order and one individual (a4/b4) does not appear in the second file.

Importantly, all ID values and files should conform to the following:

values are delimited by 1 or more whitespace characters (tab or space)
one observation/individual per row/line; each line must have same number of fields
values cannot contain spaces, tabs, commas (,) or plus (+) characters
missing values must be explicitly indicated (by "." or another specified code, see below)

A dicitonary file describing these ID tables would be as follows, e.g. in the file example.dict

     id1.txt A B
     id2.txt B C

The dictionary file lists each file in the database, followed by the field names in each. This dictionary thereby specifies that the second field in id1.txt should correspond with the first field in id2.txt as they both represent the B ID scheme. The dictionary file can also contain other commands, described below. Dictionaries can include full paths (i.e. database files can reside in different directories).

The basic command

plink --id-dict example.dict

will load all the ID data, check for consistency and generate the following in the LOG file

     ID helper, with dictionary [ example.dict ]
     Read 3 unique fields
     Reading [ id1.txt ] with fields : A, B
     Reading [ id2.txt ] with fields : B, C
     Writing output to [ plink.id ]
     4 unique records retrieved

The default behavior is to generate a file

     plink.id

that contains all the fields, with a header row included:

     A       B       C
     a1      b1      c1
     a2      b2      c2
     a3      b3      c3
     a4      b4      .

Because the last individual wasn't listed for the C field, a missing character (period/full stop ".") is entered.

Consistency checks

Imagine that one of the IDs had been entered incorrectly, for example if id2.txt has c2 repeated:

     b2 c2
     b1 c1
     b3 c2

PLINK would report this probelm when loading the file, pointing out the inconsistency:

     *** Problems were detected in the ID lists:

     Two unique entries [ B = b2 and b3 ] that match elsewhere
      a) A=a2 B=b2 C=c2
      b) B=b3 C=c2

That is, PLINK has spotted that two entries are matched for the C field, but have different values for the B field. As these values are assumed to be unique identifiers, this is an inconsistency that must be fixed by the user. Inconsistencies across files or involving more than 2 ID fields can also be spotted.

Attributes

In the example above, consider that id2.txt has been fixed, but that we now have a third file, id3.txt:

     a1 c1 M Wave1
     a2 c2 M Wave2
     a3 c3 F Wave2
     a4 c4 F Wave1

The third and fourth fields have non-unique values (e.g. M, for male, is repeated). In this example, this is because they contain information (attributes) that we want to track along with the sample IDs, but which is not an ID itself, i.e. the sex and source of the sample. It is possible to indicate the certain fields are to be treated not as identifiers (that, by definition, should be unique for each individual) but instead as attributes, as follows: the dictionary now reads:

     id1.txt A B
     id2.txt B C
     id3.txt A C Sex Source : attrib=Sex,Source

using the attrib= keyword after a colon : character to specify that the fields Sex and Source are attributes, not idenitifiers. This effectively means that duplicates are allowed, and that these values will not be considered when attempting to reconcile individuals across files.

Note All dictionary commands follow the filename and field headings; a colon character must come before any keyword; all items must be on the same line.

The LOG file now reads

     ID helper, with dictionary [ e.dict ]
     Read 5 unique fields
        Attribute fields: Sex Source
     Reading [ id1.txt ] with fields : A, B
     Reading [ id2.txt ] with fields : B, C
     Reading [ id3.txt ] with fields : A, C, Sex, Source

noting that Sex and Source are attributes. The output file plink.id now reads

      A       B       C     Sex    Source
     a1      b1      c1       M     Wave1
     a2      b2      c2       M     Wave2
     a3      b3      c3       F     Wave2
     a4      b4      c4       F     Wave1

Note that the columns are sorted in alphabetical order. Also note that we now see the fourth individual's value for the C field in this third file (c4) and so it is no longer missing.

Aliases

PLINK supports the use of aliases, where variant forms of an ID value are understood to map to the same individual. For example, an individual sample might have been sent for genotyping twice and received two distinct IDs, that we really want to treat as refering to the same person.

Aliases can be specified in two ways: either by listing the same ID field twice (or more) in a file, or by entering a comma-delimited list of terms as a single value. For example, if the dictionary line is

     a.txt C C

and the file a.txt is

     c1 .
     c2 .
     c3 c3_w2

For the first two individuals, there are no aliases specified (as there is a missing value for the second field). For the third individual, this indicates that any instance of c3_w2 for the C field should be treated as an alias for c3.

Equivalently, the original id2.txt could simply be modified as follows:

     b2 c2
     b1 c1
     b3 c3,c3_w2

i.e. a comma-delimited list of two or more values indicates the additional values are aliases for the original value. Note that aliases must always be unique. The first value encountered is always the preferred value, to which aliases are converted.

For example, if the file id3.txt was in fact,

     a1 c1    10 Wave1
     a2 c2    10 Wave2
     a3 c3_w2 12 Wave2
     a4 c4    23 Wave1

but the appropriate alias for c3 had been specified in one of the two ways mentioned above, PLINK should run correctly, automatically converting c3_w2 to c3 and producing the output file plink.id

     A       B       C       Sex     Source
     a1      b1      c1      M       Wave1
     a2      b2      c2      M       Wave2
     a3      b3      c3      F       Wave2
     a4      b4      c4      F       Wave1

Finally, the command --id-alias generates a file plink.id.eq that lists all aliases and the preferred value that are found in the database: e.g. (other aliases listed here just for illustration)

     FIELD     PREF      EQUIV
         C       c3      c3_w2
         C       c3         C3
         A       a1      ID-a1

Joint ID specification

An individual can be uniquely specified by a combination of two or more IDs instead of a single ID, for example, by a family ID and individual ID, or a project ID and an individual ID. This is represented in the dictionary as follows:

 
     id1.list  PROJ  FID  IID  : joint=FID,IID

Note, if a joint ID is specified, then all joint IDs must appear in subsequent files, e.g. a dictionary file that read as follows:

     id1.list  PROJ     FID   IID  : joint=FID,IID
     id2.list  CLIN_ID  IID

would give an error

     ERROR: Need to specify all joint fields in dictionary, [id2.list ]

A correct dictionary would read: (note, the order of the fields within the file is not important)

     id1.list  PROJ     FID   IID  : joint=FID,IID
     id2.list  CLIN_ID  IID   FID

This means that a different individuals can share the same FID, for example:

     FID     IID
     F0001   1
     F0001   2
     F0002   1
     F0002   2

now denote four unique individuals.

NOTE You can create joint IDs containing more than two fields, e.g. joint=X,Y,Z. The order of the joint fields does not need to be the same in all files. Also, you only need to specify the "joint=X,Y,.." command once in the dictionary. Finally, you can also have multiple joint fields:

 
     id1.list  SITE   PROJ   FID        IID  : joint=FID,IID joint=SITE,PROJ
     id2.list  FID    IID    CLIN_ID
     id3.list  SITE   PROJ   RECRUIT_ID

HINT The set:field=value command, described below, can be used to create joint IDs. This can be useful to ensure no accidental overlap of ID schemes between files from different sources. See below for an example.

Filtering / lookup options

It is possible to restrict the output to certain rows or columns of the total database. For example, to only output fields C and Sex, add the command

     --id-table C,Sex

To lookup all fields on a particular individual, e.g. with a given ID value for the B ID scheme, use the command

     --id-lookup B=b2

This prints a message to the LOG indicating that a lookup is being performed

     Lookup up items matching:
       B = b2  (id)

and the output file now only contains a single row

      A      B      C   Sex    Source
     a2     b2     c2     M     Wave2

It is possible to lookup an individual based on an alias, e.g. in the example above,

     --id-lookup C=c3_w2

produces the output in the LOG

     Lookup up items matching:
       C = c3  (id)

indicating that the query term alias has been replaced with the preferred value, and the output is

      A      B      C   Sex    Source
     a3     b3     c3     F     Wave2

Lookups can also be based on attributes and involve multiple fields, in which case the row must match all the specified field values:

        --id-lookup Sex=M,Source=Wave2

for example

     Looking up items matching: 
       Sex = M  (attribute)
       Source = Wave2  (attribute)
     Writing output to [ plink.id ]
     1 unique records retrieved

and the output in plink.id is

      A      B      C   Sex    Source
     a2     b2     c2     M     Wave2

NOTE It is not currently possible to specify ranges of numerical values (e.g. less than 10) or wildcards, (e.g. Wave*) when performing --id-lookup.

Replace ID schemes in external files

The command takes three fixed arguments, possibly followed by additional options:

     --id-replace  file  old-ID  new-ID  {options}

will use the information specified in the dictionary to read in an external file (i.e. not specified in the dictionary) and replace or update the IDs as requested. Consider the data file mydata.dat:

      A   v1 v2 v3 v4   v5
     a1    0  0  1  1 0.23
     a3    1  1  0  1 0.35
     a5    0  0  0  1 0.54

Then the command

plink --id-dict ex.dict --id-replace mydata.dat A C header

will lookup up the value for A in mydata.dat, using the fact that this file has a header row, and replace it, if possible, with the value for C for that person. This prints the following in the LOG:

    Replacing A with C from [ mydata.dat ]
    Writing new file to [ plink.rep ]
    Set to keep original value for unmatched observations
    Could not find matches for 1 lines

The file plink.rep contains the updated file:

    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35
    a5  0  0  0  1  0.54

The last line did not match any entry in the database (a5) and so, by default, it is left as is. Otherwise, the appropriate C ID schemes have been swapped in for the other two indiviauls, and the header has been changed.

To change to default behavior when a non-matching individual is encountered, use one of the following options: warn, skip, miss or list. For example,

plink --id-dict ex.dict --id-replace mydata.dat A C header warn

will produce an error in the LOG file

     ERROR: Could not find replacement for a5

and not proceed any further. The option

plink --id-dict ex.dict --id-replace mydata.dat A C header skip

will simply ignore that line, not printing it in plink.rep which will now read

    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35

The option

plink --id-dict ex.dict --id-replace mydata.dat A C header miss

will replace the non-matching ID with the missing code NA,

    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35
    NA  0  0  0  1  0.54

Finally, the option

plink --id-dict ex.dict --id-replace mydata.dat A C header list

will list in plink.rep any individual that did not match: in this case, it will just list

a5

It is possible to combine both aliases (in the target file) and joint IDs (as both the target and replacement ID) with the --id-replace function. This is specified by use of the plus "+" symbol, e.g.

plink --id-dict ex.dict --id-replace mydata2.dat GENOID FID+IID header

will replace the single entry of GENOID with the two values for FID and IID.

Finally, if the file does not contain a header row, use the field option:

plink --id-dict ex.dict --id-replace mydata.dat A C field=1

which tells PLINK that column 1 of mydata.dat contains the A file. If the target ID is a joint ID, the same notation can be used in this case:

plink --id-dict ex.dict --id-replace mydata2.dat FID+IID GENOID field=2+5

for example, to indicate that FID is in column 2 and IID is in column 3. In this case, column 5 will be printed as blank, and so effectively skipped. When the replacing ID is a joint ID, all joint values replace the first matched field, i.e. in this case would have been inserted as columns 2, 3, etc, if the replacement field was in fact a joint ID rather than just GENOID.

Match multiple files based on IDs

This option takes an index file and one or more other files and sorts these files to match the order of the index file (inserting blank rows if needed, or dropping rows if they are not present in the index file, as specified), using IDs as defined in the dictionary, in the format


     --id-match {file} {ID}  {file} {ID}  {file} {ID} ...  { + options }

where N is the number of files to be matched. For example,

plink --id-dict ex.dict --id-match dat1.dat A,1 dat2.txt C dat3.txt C

would generate a new file

     plink.match

that lines up the the rows in dat2.txt and dat3.txt to match dat1.dat, using the ID database specified by ex.dict. The IDs are specified as follows:

     A       Field A, assume header exists and contains A
     A,2     Field A, 2nd column of file, assume no header
     A+B     Joint ID A and B, assume header exists
     A+B,2+3 Joint ID A and B, in 2nd and 3rd columns, no header

Therefore, the above implies that dat1.dat does not contain a header row, but the other two files do. That is, by specifying a number following a comma, we implicitly tell PLINK both that no header exists, and which column to look in. Otherwise we assume the header should contain the named field (an error will be reported otherwise). In all cases the files to be matched must be rectangular, i.e. having the same number of whitespace-delimited fields.

To print only the rows that are present in all files, add the option complete as follows:

   --id-match f1.txt ID f2.txt ID + complete

Otherwise by default, missing values are printed when the data are not present in one of the files.

NOTE For any individuals not found in the database, they are listed in a file named plink.noid and a message is printed in the LOG file.

Quick match multiple files based on IDs, without a dictionary

If the --id-match command is used without specifying a data dictionary, i.e. there is no --id-dict, then we assume a simple correspondence of ID schemes between files. This can provide a quick way to join up rectangular text files based on a common key, e.g.

./plink --id-match f1.txt ID f2.txt ID,2 f3.txt IID

Note: when a field position is specified, it does not matter what the field is named (as there is no database to look it up in, in any case). Similarly, the ID field may have a different name in some files, e.g. IID not ID in f3.txt. Importantly, however, we assume the specific entries in these files all come from the same ID scheme, i.e. otherwise a dictionary should be specified to map between schemes.

Miscellaneous

The dictionary file can specify whether the file has a header row by adding the keyword header in the dictionary. The missing= keyword can also be used to specify one or more missing value codes, that are specific to that file.

   id1.list A B : header
   id2.list B C 
   id3.list C D : attrib=D header missing=NA,-9

The `set` command

For an attribute, or part of a joint ID, it is possible to use the set command to specify that all individuals in that file have a particular ID value inserted. This can be useful, for example, if samples from several sources are being grouped, and one wants to ensure no accidental overlap between samples: e.g. if one site sends a file site1.txt with individuals

and another site sends a similar file, site2.txt, that refers to three different individuals

     1
     2 
     3

the dictionary ex2.dict could read

     site1.txt ID : set:SITE=1 joint=ID,SITE header
     site2.txt ID : set:SITE=2

then

plink --id-dict ex2.dict

will produce a file plink.id that reads

Note the specific format, with a colon and equals sign but no spaces:

     set:field=value

List all instances of an ID across files

To get a list of all instances of an ID value across multiple files, use the command

plink --id-dict ex.dict --id-dump A=a1

will list to the LOG file

     Reporting rows that match [ A=a1 ]

     id1.txt : A = a1
     id1.txt : B = b1

     id3.txt : A = a1
     id3.txt : C = c1
     id3.txt : Sex = M
     id3.txt : Source = Wave1

This can be useful in tracking down where incorrect IDs are located across multiple files, for example, in order to manually resolve inconsistencies, etc.

This document last modified Wednesday, 25-Jan-2017 11:39:27 EST

Whole genome association analysis toolset

ID helper