| 1. Introduction 
2. Basic information 
3. Download and general notes 
4. Command reference table 
5. Basic usage/data formats 
6. Data management 
 
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK |  | ID helperPLINK includes a set of utitily options designed to help manage
ID-related project data. In large projects, ID schemes can be
difficult to manage. This set of options is aimed at scenarios in
which individuals have been assigned multiple IDs, meaning that
multiple lookup tables are needed to translate between schemes,
although more basic tasks (e.g. joining multiple files based on a
single shared ID) are supported.  In particular, these options will:
 These functions are generic, in that they are not tied to any
particular format or scheme of IDs used by PLINK. In fact, the
"individuals" need not be samples, but could be anything, e.g. SNPs
with RS numbers and vendor-specific codes. These options are
specifically aimed for cases where ID data, along with limited amounts
of secondary attributes (e.g. sex, age, etc; or chromosome, map
position in the case of SNPs, etc) are stored in flat, rectangular
text files.
Obviously there are many other ways to perform such tasks, for
example, using any standard relational database, a perl script or
Excel. Depending on your needs, you may or may not find the options
implemented here quicker, easier or more reliable than some these
alternatives. Combine multiple (partially overlapping) ID schemes
  Spot inconsistencies
  Track other (non-unique) attributes along with identifier information
  Filter subsets of this database, for quick look-ups
  Allow ID aliases
  Allow individuals to be uniquely specified by two or more IDs, such as family ID and individual ID
  Automatically collate and update ID schemes in external files
  Merge multiple files based on multiple ID schemes
 Example of usageAs an example: consider the following case, in which ID information is
spread across four files: family and individual IDs from two
sites, collab12.txt (site ID, family ID and individual ID)
     1    F00001  1
     1    F00001  2
     1    F00001  3
     1    F00002  1
     1    F00002  2
     1    F00002  3
     2    C101    P1
     2    C101    M2
     2    C101    C2
Similar information from a third sample, but with some additional information appended:
     3 1 F00001_1    3/12/09         F
     3 1 F00001_2    NA              NA
     3 1 F00001_3    3/17/09         M
Then we have a report back from the genotyping lab, on some of the
samples (and which also includes some other samples)
     SITE FID       IID    GENO   PASS
     1    F00001      1    S001      Y
     1    F00001      2    S002      Y
     1    F00001      3    S003      Y
     1    F00002      2    S004      N
     1    F00002      3    S005      Y
     2    C101       P1    S006      Y
     2    C101       M2    S007      N
     2    C101       C2    S008      Y
     2    X1         X1    S009      Y
Finally, we also have information on yet a further set of IDs assigned
in a follow-up stage of the project, that are tied to the IDs assigned
by the genotyping lab, rather than the original collaborator IDs:
     S001    fu_01_a
     S002    fu_01_b
     S003    fu_01_c
     S005    fu_01_d
     S006    fu_01_e
     S008    fu_01_f
     S009    fu_01_g
As described, below, the following dictionary file
(proj1.dict) is specified to track this information:
     collab12.txt  SITE FID IID           : joint=SITE,FID,IID
     collab3.txt   SITE FID IID DATE SEX  : attrib=DATE,SEX missing=NA
     geno.txt      SITE FID IID GENO PASS : attrib=PASS header
     followup.txt  GENO FUID
and the command
 plink --id-dict proj1.dict 
will collate all the files (after checking for inconsistencies, etc) into 
a single table, with missing values inserted where appropriate:
       DATE        FID        FUID     GENO          IID  PASS   SEX  SITE
          .     F00001     fu_01_a     S001            1     Y     .     1
          .     F00001     fu_01_b     S002            2     Y     .     1
          .     F00001     fu_01_c     S003            3     Y     .     1
          .     F00002           .        .            1     .     .     1
          .     F00002           .     S004            2     N     .     1
          .     F00002     fu_01_d     S005            3     Y     .     1
          .       C101     fu_01_e     S006           P1     Y     .     2
          .       C101           .     S007           M2     N     .     2
          .       C101     fu_01_f     S008           C2     Y     .     2
    3/12/09          1           .        .     F00001_1     .     F     3
          .          1           .        .     F00001_2     .     .     3
    3/17/09          1           .        .     F00001_3     .     M     3
          .         X1     fu_01_g     S009           X1     Y     .     2
There are then numerous commands that can search this database, and
update or match external files based on any of the ID schemes.  There
is also a command for joining two or more files based on a single ID
scheme, which does not require a dictionary/database to be
specified. This could be of use, for example, to quickly line up
partially overlapping output from PLINK, based on SNP RS numbers, for
example.OverviewThe idea is that all data are kept in simple plain text files, and
that the complete "master file" is then generated on-the-fly. This
makes it easier to add and edit individual components of the ID
database (i.e. the individual files).
Note In contrast to a full database, there is no
support for hierarchical, relational data structures. That is, all
observations in all tables must be of the same fundamental unit
(e.g. a single individual).
Consider we have three sets of IDs, labelled A, B and C, on up to four
individuals. These are described across two files, id1.txt, which 
lists the A and B schemes (coded here for clarity to simply be a1, a2, etc)
     a1 b1
     a2 b2
     a3 b3
     a4 b4
and id2.txt, which contains the B and C codes for 3 individuals:
     b2 c2
     b1 c1
     b3 c3
For example, the individual labelled a1 under the A scheme is
called b1 under the B scheme.  Note that in id2.txt
the individuals are in a different order and one individual (a4/b4)
does not appear in the second file.
Importantly, all ID values and files should conform to the following:
 A dicitonary file describing these ID tables would be as
follows, e.g. in the file example.dict values are delimited by 1 or more whitespace characters (tab or space)
  one observation/individual per row/line; each line must have same number of fields
  values cannot contain spaces, tabs, commas (,) or plus (+) characters
  missing values must be explicitly indicated (by "." or another specified code, see below)
 
     id1.txt A B
     id2.txt B C
The dictionary file lists each file in the database, followed by the
field names in each. This dictionary thereby specifies that the second
field in id1.txt should correspond with the first field
in id2.txt as they both represent the B ID scheme. The dictionary file
can also contain other commands, described below. Dictionaries can include full paths
(i.e. database files can reside in different directories).
The basic command
plink --id-dict example.dict
will load all the ID data, check for consistency and generate the
following in the LOG file
     ID helper, with dictionary [ example.dict ]
     Read 3 unique fields
     Reading [ id1.txt ] with fields : A, B
     Reading [ id2.txt ] with fields : B, C
     Writing output to [ plink.id ]
     4 unique records retrieved
The default behavior is to generate a file
     plink.id
that contains all the fields, with a header row included:
     A       B       C
     a1      b1      c1
     a2      b2      c2
     a3      b3      c3
     a4      b4      .
Because the last individual wasn't listed for the C field, a missing
character (period/full stop ".") is entered.Consistency checksImagine that one of the IDs had been entered incorrectly, for example
if id2.txt has c2 repeated:
     b2 c2
     b1 c1
     b3 c2
PLINK would report this probelm when loading the file, pointing out the inconsistency:
     *** Problems were detected in the ID lists:
     Two unique entries [ B = b2 and b3 ] that match elsewhere
      a) A=a2 B=b2 C=c2
      b) B=b3 C=c2
That is, PLINK has spotted that two entries are matched for the C
field, but have different values for the B field. As these values are
assumed to be unique identifiers, this is an inconsistency that must
be fixed by the user. Inconsistencies across files or involving more
than 2 ID fields can also be spotted.AttributesIn the example above, consider that id2.txt has been fixed,
but that we now have a third file, id3.txt:
     a1 c1 M Wave1
     a2 c2 M Wave2
     a3 c3 F Wave2
     a4 c4 F Wave1
The third and fourth fields have non-unique values (e.g. M, for male,
is repeated). In this example, this is because they contain
information (attributes) that we want to track along with the sample
IDs, but which is not an ID itself, i.e. the sex and source of the
sample. It is possible to indicate the certain fields are to be
treated not as identifiers (that, by definition, should be
unique for each individual) but instead as attributes, as
follows: the dictionary now reads:
     id1.txt A B
     id2.txt B C
     id3.txt A C Sex Source : attrib=Sex,Source
using the attrib= keyword after a colon : character
to specify that the fields Sex and Source are
attributes, not idenitifiers. This effectively means that duplicates
are allowed, and that these values will not be considered when attempting 
to reconcile individuals across files.
Note All dictionary commands follow the filename and
field headings; a colon character must come before any keyword; all
items must be on the same line.
The LOG file now reads
     ID helper, with dictionary [ e.dict ]
     Read 5 unique fields
        Attribute fields: Sex Source
     Reading [ id1.txt ] with fields : A, B
     Reading [ id2.txt ] with fields : B, C
     Reading [ id3.txt ] with fields : A, C, Sex, Source
noting that Sex and Source are attributes. The
output file plink.id now reads
      A       B       C     Sex    Source
     a1      b1      c1       M     Wave1
     a2      b2      c2       M     Wave2
     a3      b3      c3       F     Wave2
     a4      b4      c4       F     Wave1
Note that the columns are sorted in alphabetical order. Also note that
we now see the fourth individual's value for the C field
in this third file (c4) and so it is no longer missing.AliasesPLINK supports the use of aliases, where variant forms of an ID value
are understood to map to the same individual. For example, an
individual sample might have been sent for genotyping twice and
received two distinct IDs, that we really want to treat as refering 
to the same person.
Aliases can be specified in two ways: either by listing the same ID
field twice (or more) in a file, or by entering a comma-delimited list
of terms as a single value. For example, if the dictionary line is
     a.txt C C
and the file a.txt is
     c1 .
     c2 .
     c3 c3_w2
For the first two individuals, there are no aliases specified (as
there is a missing value for the second field). For the third
individual, this indicates that any instance of c3_w2 for the
C field should be treated as an alias for c3.
Equivalently, the original id2.txt could simply be modified as follows:
     b2 c2
     b1 c1
     b3 c3,c3_w2
i.e. a comma-delimited list of two or more values indicates the
additional values are aliases for the original value.  Note that
aliases must always be unique. The first value encountered is always the
preferred value, to which aliases are converted.
For example, if the file id3.txt was in fact,
     a1 c1    10 Wave1
     a2 c2    10 Wave2
     a3 c3_w2 12 Wave2
     a4 c4    23 Wave1
but the appropriate alias for c3 had been specified in one of
the two ways mentioned above, PLINK should run correctly,
automatically converting c3_w2 to c3 and producing
the output file plink.id
     A       B       C       Sex     Source
     a1      b1      c1      M       Wave1
     a2      b2      c2      M       Wave2
     a3      b3      c3      F       Wave2
     a4      b4      c4      F       Wave1
Finally, the command --id-alias generates a file 
plink.id.eq that lists all aliases and the preferred value 
that are found in the database: e.g. (other aliases listed here 
just for illustration)
     FIELD     PREF      EQUIV
         C       c3      c3_w2
         C       c3         C3
         A       a1      ID-a1
Joint ID specificationAn individual can be uniquely specified by a combination of two or
more IDs instead of a single ID, for example, by a family ID and
individual ID, or a project ID and an individual ID.  This is
represented in the dictionary as follows: 
     id1.list  PROJ  FID  IID  : joint=FID,IID
Note, if a joint ID is specified, then all joint IDs must appear in
subsequent files, e.g. a dictionary file that read as follows:
     id1.list  PROJ     FID   IID  : joint=FID,IID
     id2.list  CLIN_ID  IID 
would give an error
     ERROR: Need to specify all joint fields in dictionary, [id2.list ]
A correct dictionary would read: (note, the order of the fields within the file is not important)
     id1.list  PROJ     FID   IID  : joint=FID,IID
     id2.list  CLIN_ID  IID   FID
This means that a different individuals can share the same FID, 
for example:
     FID     IID
     F0001   1
     F0001   2
     F0002   1
     F0002   2
now denote four unique individuals.
NOTE You can create joint IDs containing more than
two fields, e.g. joint=X,Y,Z. The order of the joint fields
does not need to be the same in all files. Also, you only need to
specify the "joint=X,Y,.." command once in the dictionary. Finally,
you can also have multiple joint fields: 
     id1.list  SITE   PROJ   FID        IID  : joint=FID,IID joint=SITE,PROJ
     id2.list  FID    IID    CLIN_ID
     id3.list  SITE   PROJ   RECRUIT_ID
HINT The set:field=value command, described 
below, can be used to create joint IDs. This can be useful to ensure 
no accidental overlap of ID schemes between files from different sources. 
See below for an example.Filtering / lookup optionsIt is possible to restrict the output to certain rows or columns of
the total database.  For example, to only output fields C
and Sex, add the command
     --id-table C,Sex
To lookup all fields on a particular individual, e.g. with a given ID value for the B ID scheme, use the command
     --id-lookup B=b2
This prints a message to the LOG indicating that a lookup is being
performed
     Lookup up items matching:
       B = b2  (id)
and the output file now only contains a single row
      A      B      C   Sex    Source
     a2     b2     c2     M     Wave2
It is possible to lookup an individual based on an alias, e.g. in the
example above,
     --id-lookup C=c3_w2
produces the output in the LOG
     Lookup up items matching:
       C = c3  (id)
indicating that the query term alias has been replaced with the preferred value, and the output is
      A      B      C   Sex    Source
     a3     b3     c3     F     Wave2
Lookups can also be based on attributes and involve multiple
fields, in which case the row must match all the specified field 
values:
        --id-lookup Sex=M,Source=Wave2
for example
     Looking up items matching: 
       Sex = M  (attribute)
       Source = Wave2  (attribute)
     Writing output to [ plink.id ]
     1 unique records retrieved
and the output in plink.id is
      A      B      C   Sex    Source
     a2     b2     c2     M     Wave2
NOTE It is not currently possible to specify
ranges of numerical values (e.g. less than 10) or wildcards,
(e.g. Wave*) when performing --id-lookup.Replace ID schemes in external filesThe command takes three fixed arguments, possibly followed by additional options:
     --id-replace  file  old-ID  new-ID  {options}
will use the information specified in the dictionary to read in an
external file (i.e. not specified in the dictionary) and replace or
update the IDs as requested. Consider the data file mydata.dat:
      A   v1 v2 v3 v4   v5
     a1    0  0  1  1 0.23
     a3    1  1  0  1 0.35
     a5    0  0  0  1 0.54
Then the command
 plink --id-dict ex.dict --id-replace mydata.dat A C header
will lookup up the value for A in mydata.dat, using the fact 
that this file has a header row, and replace it, if possible, with the value 
for C for that person. This prints the following in the LOG:
    Replacing A with C from [ mydata.dat ]
    Writing new file to [ plink.rep ]
    Set to keep original value for unmatched observations
    Could not find matches for 1 lines
The file plink.rep contains the updated file:
    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35
    a5  0  0  0  1  0.54
The last line did not match any entry in the database (a5)
and so, by default, it is left as is. Otherwise, the appropriate C ID
schemes have been swapped in for the other two indiviauls, and the header
has been changed.
To change to default behavior when a non-matching individual is
encountered, use one of the following
options: warn, skip, miss or list. For example, 
 plink --id-dict ex.dict --id-replace mydata.dat A C header warn
will produce an error in the LOG file
     ERROR: Could not find replacement for a5
and not proceed any further. The option 
 plink --id-dict ex.dict --id-replace mydata.dat A C header skip
will simply ignore that line, not printing it in plink.rep which 
will now read
    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35
The option 
 plink --id-dict ex.dict --id-replace mydata.dat A C header miss
will replace the non-matching ID with the missing code NA,
    C  v1  v2  v3  v4  v5
    c1  0  0  1  1  0.23
    c3  1  1  0  1  0.35
    NA  0  0  0  1  0.54
Finally, the option 
plink --id-dict ex.dict --id-replace mydata.dat A C header list 
will list in plink.rep any individual that did not match: in this 
case, it will just list
     a5
It is possible to combine both aliases (in the target file) and joint
IDs (as both the target and replacement ID) with
the --id-replace function. This is specified by use of the plus "+"
symbol, e.g.
plink --id-dict ex.dict --id-replace mydata2.dat GENOID FID+IID  header
will replace the single entry of GENOID with the two values for FID 
and IID. 
Finally, if the file does not contain a header row, use the field option:
plink --id-dict ex.dict --id-replace mydata.dat A C  field=1
which tells PLINK that column 1 of mydata.dat contains the A file. If the target ID 
is a joint ID, the same notation can be used in this case:
plink --id-dict ex.dict --id-replace mydata2.dat FID+IID GENOID field=2+5
for example, to indicate that FID is in column 2 and IID is in column 3. 
In this case, column 5 will be printed as blank, and so effectively skipped. When the 
replacing ID is a joint ID, all joint values replace the first matched field, i.e. in this 
case would have been inserted as columns 2, 3, etc, if the replacement field was in fact 
a joint ID rather than just GENOID.Match multiple files based on IDsThis option takes an index file and one or more other files and sorts
these files to match the order of the index file (inserting blank rows
if needed, or dropping rows if they are not present in the index file,
as specified), using IDs as defined in the dictionary, in the format
     --id-match {file} {ID}  {file} {ID}  {file} {ID} ...  { + options }
where N is the number of files to be matched. For example,
plink --id-dict ex.dict --id-match dat1.dat A,1 dat2.txt C dat3.txt C
would generate a new file
     plink.match
that lines up the the rows in dat2.txt and dat3.txt
to match dat1.dat, using the ID database specified
by ex.dict. The IDs are specified as follows:
     A       Field A, assume header exists and contains A
     A,2     Field A, 2nd column of file, assume no header
     A+B     Joint ID A and B, assume header exists
     A+B,2+3 Joint ID A and B, in 2nd and 3rd columns, no header
Therefore, the above implies that dat1.dat does not contain a
header row, but the other two files do. That is, by specifying a
number following a comma, we implicitly tell PLINK both that no header
exists, and which column to look in. Otherwise we assume the header
should contain the named field (an error will be reported
otherwise). In all cases the files to be matched must be rectangular,
i.e. having the same number of whitespace-delimited fields.
To print only the rows that are present in all files, add the option
complete as follows:
   --id-match f1.txt ID f2.txt ID + complete
Otherwise by default, missing values are printed when the data are not 
present in one of the files. 
NOTE For any individuals not found in the database, they 
are listed in a file named plink.noid and a message is printed 
in the LOG file. Quick match multiple files based on IDs, without a dictionaryIf the --id-match command is used without specifying a data
dictionary, i.e. there is no --id-dict, then we assume a
simple correspondence of ID schemes between files. This can provide a
quick way to join up rectangular text files based on a common key, e.g. 
 ./plink --id-match f1.txt ID  f2.txt ID,2  f3.txt IID
Note: when a field position is specified, it does not matter what the field 
is named (as there is no database to look it up in, in any case). Similarly, 
the ID field may have a different name in some files, e.g. IID not
ID in f3.txt. Importantly, however, we assume the specific
entries in these files all come from the same ID scheme, i.e. otherwise a 
dictionary should be specified to map between schemes.MiscellaneousThe dictionary file can specify whether the file has a header row by
adding the keyword header in the dictionary. The missing=
keyword can also be used to specify one or more missing value codes, that 
are specific to that file.
   id1.list A B : header
   id2.list B C 
   id3.list C D : attrib=D header missing=NA,-9
 The set commandFor an attribute, or part of a joint ID, it is possible to use
the set command to specify that all individuals in that file
have a particular ID value inserted. This can be useful, for example,
if samples from several sources are being grouped, and one wants to
ensure no accidental overlap between samples: e.g.  if one site sends
a file site1.txt with individuals
   ID
    1
    2
    3 
    4
and another site sends a similar file, site2.txt, that refers to three different individuals
     1
     2 
     3
the dictionary ex2.dict could read
     site1.txt ID : set:SITE=1 joint=ID,SITE header
     site2.txt ID : set:SITE=2
then
plink --id-dict ex2.dict
will produce a file plink.id that reads
  ID  SITE
   1     1
   2     1
   3     1
   4     1
   1     2
   2     2
   3     2
Note the specific format, with a colon and equals sign but no spaces: 
     set:field=value
List all instances of an ID across filesTo get a list of all instances of an ID value across multiple files, use the command
plink --id-dict ex.dict --id-dump A=a1
will list to the LOG file
     Reporting rows that match [ A=a1 ]
     id1.txt : A = a1
     id1.txt : B = b1
     id3.txt : A = a1
     id3.txt : C = c1
     id3.txt : Sex = M
     id3.txt : Source = Wave1
This can be useful in tracking down where incorrect IDs are located across multiple files, for example, in order
to manually resolve inconsistencies, etc. |  |