1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
ID helper
PLINK includes a set of utitily options designed to help manage
ID-related project data. In large projects, ID schemes can be
difficult to manage. This set of options is aimed at scenarios in
which individuals have been assigned multiple IDs, meaning that
multiple lookup tables are needed to translate between schemes,
although more basic tasks (e.g. joining multiple files based on a
single shared ID) are supported. In particular, these options will:
- Combine multiple (partially overlapping) ID schemes
- Spot inconsistencies
- Track other (non-unique) attributes along with identifier information
- Filter subsets of this database, for quick look-ups
- Allow ID aliases
- Allow individuals to be uniquely specified by two or more IDs, such as family ID and individual ID
- Automatically collate and update ID schemes in external files
- Merge multiple files based on multiple ID schemes
These functions are generic, in that they are not tied to any
particular format or scheme of IDs used by PLINK. In fact, the
"individuals" need not be samples, but could be anything, e.g. SNPs
with RS numbers and vendor-specific codes. These options are
specifically aimed for cases where ID data, along with limited amounts
of secondary attributes (e.g. sex, age, etc; or chromosome, map
position in the case of SNPs, etc) are stored in flat, rectangular
text files.
Obviously there are many other ways to perform such tasks, for
example, using any standard relational database, a perl script or
Excel. Depending on your needs, you may or may not find the options
implemented here quicker, easier or more reliable than some these
alternatives.
Example of usage
As an example: consider the following case, in which ID information is
spread across four files: family and individual IDs from two
sites, collab12.txt (site ID, family ID and individual ID)
1 F00001 1
1 F00001 2
1 F00001 3
1 F00002 1
1 F00002 2
1 F00002 3
2 C101 P1
2 C101 M2
2 C101 C2
Similar information from a third sample, but with some additional information appended:
3 1 F00001_1 3/12/09 F
3 1 F00001_2 NA NA
3 1 F00001_3 3/17/09 M
Then we have a report back from the genotyping lab, on some of the
samples (and which also includes some other samples)
SITE FID IID GENO PASS
1 F00001 1 S001 Y
1 F00001 2 S002 Y
1 F00001 3 S003 Y
1 F00002 2 S004 N
1 F00002 3 S005 Y
2 C101 P1 S006 Y
2 C101 M2 S007 N
2 C101 C2 S008 Y
2 X1 X1 S009 Y
Finally, we also have information on yet a further set of IDs assigned
in a follow-up stage of the project, that are tied to the IDs assigned
by the genotyping lab, rather than the original collaborator IDs:
S001 fu_01_a
S002 fu_01_b
S003 fu_01_c
S005 fu_01_d
S006 fu_01_e
S008 fu_01_f
S009 fu_01_g
As described, below, the following dictionary file
(proj1.dict) is specified to track this information:
collab12.txt SITE FID IID : joint=SITE,FID,IID
collab3.txt SITE FID IID DATE SEX : attrib=DATE,SEX missing=NA
geno.txt SITE FID IID GENO PASS : attrib=PASS header
followup.txt GENO FUID
and the command
plink --id-dict proj1.dict
will collate all the files (after checking for inconsistencies, etc) into
a single table, with missing values inserted where appropriate:
DATE FID FUID GENO IID PASS SEX SITE
. F00001 fu_01_a S001 1 Y . 1
. F00001 fu_01_b S002 2 Y . 1
. F00001 fu_01_c S003 3 Y . 1
. F00002 . . 1 . . 1
. F00002 . S004 2 N . 1
. F00002 fu_01_d S005 3 Y . 1
. C101 fu_01_e S006 P1 Y . 2
. C101 . S007 M2 N . 2
. C101 fu_01_f S008 C2 Y . 2
3/12/09 1 . . F00001_1 . F 3
. 1 . . F00001_2 . . 3
3/17/09 1 . . F00001_3 . M 3
. X1 fu_01_g S009 X1 Y . 2
There are then numerous commands that can search this database, and
update or match external files based on any of the ID schemes. There
is also a command for joining two or more files based on a single ID
scheme, which does not require a dictionary/database to be
specified. This could be of use, for example, to quickly line up
partially overlapping output from PLINK, based on SNP RS numbers, for
example.
Overview
The idea is that all data are kept in simple plain text files, and
that the complete "master file" is then generated on-the-fly. This
makes it easier to add and edit individual components of the ID
database (i.e. the individual files).
Note In contrast to a full database, there is no
support for hierarchical, relational data structures. That is, all
observations in all tables must be of the same fundamental unit
(e.g. a single individual).
Consider we have three sets of IDs, labelled A, B and C, on up to four
individuals. These are described across two files, id1.txt, which
lists the A and B schemes (coded here for clarity to simply be a1, a2, etc)
a1 b1
a2 b2
a3 b3
a4 b4
and id2.txt, which contains the B and C codes for 3 individuals:
b2 c2
b1 c1
b3 c3
For example, the individual labelled a1 under the A scheme is
called b1 under the B scheme. Note that in id2.txt
the individuals are in a different order and one individual (a4/b4)
does not appear in the second file.
Importantly, all ID values and files should conform to the following:
- values are delimited by 1 or more whitespace characters (tab or space)
- one observation/individual per row/line; each line must have same number of fields
- values cannot contain spaces, tabs, commas (,) or plus (+) characters
- missing values must be explicitly indicated (by "." or another specified code, see below)
A dicitonary file describing these ID tables would be as
follows, e.g. in the file example.dict
id1.txt A B
id2.txt B C
The dictionary file lists each file in the database, followed by the
field names in each. This dictionary thereby specifies that the second
field in id1.txt should correspond with the first field
in id2.txt as they both represent the B ID scheme. The dictionary file
can also contain other commands, described below. Dictionaries can include full paths
(i.e. database files can reside in different directories).
The basic command
plink --id-dict example.dict
will load all the ID data, check for consistency and generate the
following in the LOG file
ID helper, with dictionary [ example.dict ]
Read 3 unique fields
Reading [ id1.txt ] with fields : A, B
Reading [ id2.txt ] with fields : B, C
Writing output to [ plink.id ]
4 unique records retrieved
The default behavior is to generate a file
plink.id
that contains all the fields, with a header row included:
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 .
Because the last individual wasn't listed for the C field, a missing
character (period/full stop ".") is entered.
Consistency checks
Imagine that one of the IDs had been entered incorrectly, for example
if id2.txt has c2 repeated:
b2 c2
b1 c1
b3 c2
PLINK would report this probelm when loading the file, pointing out the inconsistency:
*** Problems were detected in the ID lists:
Two unique entries [ B = b2 and b3 ] that match elsewhere
a) A=a2 B=b2 C=c2
b) B=b3 C=c2
That is, PLINK has spotted that two entries are matched for the C
field, but have different values for the B field. As these values are
assumed to be unique identifiers, this is an inconsistency that must
be fixed by the user. Inconsistencies across files or involving more
than 2 ID fields can also be spotted.
Attributes
In the example above, consider that id2.txt has been fixed,
but that we now have a third file, id3.txt:
a1 c1 M Wave1
a2 c2 M Wave2
a3 c3 F Wave2
a4 c4 F Wave1
The third and fourth fields have non-unique values (e.g. M, for male,
is repeated). In this example, this is because they contain
information (attributes) that we want to track along with the sample
IDs, but which is not an ID itself, i.e. the sex and source of the
sample. It is possible to indicate the certain fields are to be
treated not as identifiers (that, by definition, should be
unique for each individual) but instead as attributes, as
follows: the dictionary now reads:
id1.txt A B
id2.txt B C
id3.txt A C Sex Source : attrib=Sex,Source
using the attrib= keyword after a colon : character
to specify that the fields Sex and Source are
attributes, not idenitifiers. This effectively means that duplicates
are allowed, and that these values will not be considered when attempting
to reconcile individuals across files.
Note All dictionary commands follow the filename and
field headings; a colon character must come before any keyword; all
items must be on the same line.
The LOG file now reads
ID helper, with dictionary [ e.dict ]
Read 5 unique fields
Attribute fields: Sex Source
Reading [ id1.txt ] with fields : A, B
Reading [ id2.txt ] with fields : B, C
Reading [ id3.txt ] with fields : A, C, Sex, Source
noting that Sex and Source are attributes. The
output file plink.id now reads
A B C Sex Source
a1 b1 c1 M Wave1
a2 b2 c2 M Wave2
a3 b3 c3 F Wave2
a4 b4 c4 F Wave1
Note that the columns are sorted in alphabetical order. Also note that
we now see the fourth individual's value for the C field
in this third file (c4) and so it is no longer missing.
Aliases
PLINK supports the use of aliases, where variant forms of an ID value
are understood to map to the same individual. For example, an
individual sample might have been sent for genotyping twice and
received two distinct IDs, that we really want to treat as refering
to the same person.
Aliases can be specified in two ways: either by listing the same ID
field twice (or more) in a file, or by entering a comma-delimited list
of terms as a single value. For example, if the dictionary line is
a.txt C C
and the file a.txt is
c1 .
c2 .
c3 c3_w2
For the first two individuals, there are no aliases specified (as
there is a missing value for the second field). For the third
individual, this indicates that any instance of c3_w2 for the
C field should be treated as an alias for c3.
Equivalently, the original id2.txt could simply be modified as follows:
b2 c2
b1 c1
b3 c3,c3_w2
i.e. a comma-delimited list of two or more values indicates the
additional values are aliases for the original value. Note that
aliases must always be unique. The first value encountered is always the
preferred value, to which aliases are converted.
For example, if the file id3.txt was in fact,
a1 c1 10 Wave1
a2 c2 10 Wave2
a3 c3_w2 12 Wave2
a4 c4 23 Wave1
but the appropriate alias for c3 had been specified in one of
the two ways mentioned above, PLINK should run correctly,
automatically converting c3_w2 to c3 and producing
the output file plink.id
A B C Sex Source
a1 b1 c1 M Wave1
a2 b2 c2 M Wave2
a3 b3 c3 F Wave2
a4 b4 c4 F Wave1
Finally, the command --id-alias generates a file
plink.id.eq that lists all aliases and the preferred value
that are found in the database: e.g. (other aliases listed here
just for illustration)
FIELD PREF EQUIV
C c3 c3_w2
C c3 C3
A a1 ID-a1
Joint ID specification
An individual can be uniquely specified by a combination of two or
more IDs instead of a single ID, for example, by a family ID and
individual ID, or a project ID and an individual ID. This is
represented in the dictionary as follows:
id1.list PROJ FID IID : joint=FID,IID
Note, if a joint ID is specified, then all joint IDs must appear in
subsequent files, e.g. a dictionary file that read as follows:
id1.list PROJ FID IID : joint=FID,IID
id2.list CLIN_ID IID
would give an error
ERROR: Need to specify all joint fields in dictionary, [id2.list ]
A correct dictionary would read: (note, the order of the fields within the file is not important)
id1.list PROJ FID IID : joint=FID,IID
id2.list CLIN_ID IID FID
This means that a different individuals can share the same FID,
for example:
FID IID
F0001 1
F0001 2
F0002 1
F0002 2
now denote four unique individuals.
NOTE You can create joint IDs containing more than
two fields, e.g. joint=X,Y,Z. The order of the joint fields
does not need to be the same in all files. Also, you only need to
specify the "joint=X,Y,.." command once in the dictionary. Finally,
you can also have multiple joint fields:
id1.list SITE PROJ FID IID : joint=FID,IID joint=SITE,PROJ
id2.list FID IID CLIN_ID
id3.list SITE PROJ RECRUIT_ID
HINT The set:field=value command, described
below, can be used to create joint IDs. This can be useful to ensure
no accidental overlap of ID schemes between files from different sources.
See below for an example.
Filtering / lookup options
It is possible to restrict the output to certain rows or columns of
the total database. For example, to only output fields C
and Sex, add the command
--id-table C,Sex
To lookup all fields on a particular individual, e.g. with a given ID value for the B ID scheme, use the command
--id-lookup B=b2
This prints a message to the LOG indicating that a lookup is being
performed
Lookup up items matching:
B = b2 (id)
and the output file now only contains a single row
A B C Sex Source
a2 b2 c2 M Wave2
It is possible to lookup an individual based on an alias, e.g. in the
example above,
--id-lookup C=c3_w2
produces the output in the LOG
Lookup up items matching:
C = c3 (id)
indicating that the query term alias has been replaced with the preferred value, and the output is
A B C Sex Source
a3 b3 c3 F Wave2
Lookups can also be based on attributes and involve multiple
fields, in which case the row must match all the specified field
values:
--id-lookup Sex=M,Source=Wave2
for example
Looking up items matching:
Sex = M (attribute)
Source = Wave2 (attribute)
Writing output to [ plink.id ]
1 unique records retrieved
and the output in plink.id is
A B C Sex Source
a2 b2 c2 M Wave2
NOTE It is not currently possible to specify
ranges of numerical values (e.g. less than 10) or wildcards,
(e.g. Wave*) when performing --id-lookup.
Replace ID schemes in external files
The command takes three fixed arguments, possibly followed by additional options:
--id-replace file old-ID new-ID {options}
will use the information specified in the dictionary to read in an
external file (i.e. not specified in the dictionary) and replace or
update the IDs as requested. Consider the data file mydata.dat:
A v1 v2 v3 v4 v5
a1 0 0 1 1 0.23
a3 1 1 0 1 0.35
a5 0 0 0 1 0.54
Then the command
plink --id-dict ex.dict --id-replace mydata.dat A C header
will lookup up the value for A in mydata.dat, using the fact
that this file has a header row, and replace it, if possible, with the value
for C for that person. This prints the following in the LOG:
Replacing A with C from [ mydata.dat ]
Writing new file to [ plink.rep ]
Set to keep original value for unmatched observations
Could not find matches for 1 lines
The file plink.rep contains the updated file:
C v1 v2 v3 v4 v5
c1 0 0 1 1 0.23
c3 1 1 0 1 0.35
a5 0 0 0 1 0.54
The last line did not match any entry in the database (a5)
and so, by default, it is left as is. Otherwise, the appropriate C ID
schemes have been swapped in for the other two indiviauls, and the header
has been changed.
To change to default behavior when a non-matching individual is
encountered, use one of the following
options: warn, skip, miss or list. For example,
plink --id-dict ex.dict --id-replace mydata.dat A C header warn
will produce an error in the LOG file
ERROR: Could not find replacement for a5
and not proceed any further. The option
plink --id-dict ex.dict --id-replace mydata.dat A C header skip
will simply ignore that line, not printing it in plink.rep which
will now read
C v1 v2 v3 v4 v5
c1 0 0 1 1 0.23
c3 1 1 0 1 0.35
The option
plink --id-dict ex.dict --id-replace mydata.dat A C header miss
will replace the non-matching ID with the missing code NA,
C v1 v2 v3 v4 v5
c1 0 0 1 1 0.23
c3 1 1 0 1 0.35
NA 0 0 0 1 0.54
Finally, the option
plink --id-dict ex.dict --id-replace mydata.dat A C header list
will list in plink.rep any individual that did not match: in this
case, it will just list
a5
It is possible to combine both aliases (in the target file) and joint
IDs (as both the target and replacement ID) with
the --id-replace function. This is specified by use of the plus "+"
symbol, e.g.
plink --id-dict ex.dict --id-replace mydata2.dat GENOID FID+IID header
will replace the single entry of GENOID with the two values for FID
and IID.
Finally, if the file does not contain a header row, use the field option:
plink --id-dict ex.dict --id-replace mydata.dat A C field=1
which tells PLINK that column 1 of mydata.dat contains the A file. If the target ID
is a joint ID, the same notation can be used in this case:
plink --id-dict ex.dict --id-replace mydata2.dat FID+IID GENOID field=2+5
for example, to indicate that FID is in column 2 and IID is in column 3.
In this case, column 5 will be printed as blank, and so effectively skipped. When the
replacing ID is a joint ID, all joint values replace the first matched field, i.e. in this
case would have been inserted as columns 2, 3, etc, if the replacement field was in fact
a joint ID rather than just GENOID.
Match multiple files based on IDs
This option takes an index file and one or more other files and sorts
these files to match the order of the index file (inserting blank rows
if needed, or dropping rows if they are not present in the index file,
as specified), using IDs as defined in the dictionary, in the format
--id-match {file} {ID} {file} {ID} {file} {ID} ... { + options }
where N is the number of files to be matched. For example,
plink --id-dict ex.dict --id-match dat1.dat A,1 dat2.txt C dat3.txt C
would generate a new file
plink.match
that lines up the the rows in dat2.txt and dat3.txt
to match dat1.dat, using the ID database specified
by ex.dict. The IDs are specified as follows:
A Field A, assume header exists and contains A
A,2 Field A, 2nd column of file, assume no header
A+B Joint ID A and B, assume header exists
A+B,2+3 Joint ID A and B, in 2nd and 3rd columns, no header
Therefore, the above implies that dat1.dat does not contain a
header row, but the other two files do. That is, by specifying a
number following a comma, we implicitly tell PLINK both that no header
exists, and which column to look in. Otherwise we assume the header
should contain the named field (an error will be reported
otherwise). In all cases the files to be matched must be rectangular,
i.e. having the same number of whitespace-delimited fields.
To print only the rows that are present in all files, add the option
complete as follows:
--id-match f1.txt ID f2.txt ID + complete
Otherwise by default, missing values are printed when the data are not
present in one of the files.
NOTE For any individuals not found in the database, they
are listed in a file named plink.noid and a message is printed
in the LOG file.
Quick match multiple files based on IDs, without a dictionary
If the --id-match command is used without specifying a data
dictionary, i.e. there is no --id-dict, then we assume a
simple correspondence of ID schemes between files. This can provide a
quick way to join up rectangular text files based on a common key, e.g.
./plink --id-match f1.txt ID f2.txt ID,2 f3.txt IID
Note: when a field position is specified, it does not matter what the field
is named (as there is no database to look it up in, in any case). Similarly,
the ID field may have a different name in some files, e.g. IID not
ID in f3.txt. Importantly, however, we assume the specific
entries in these files all come from the same ID scheme, i.e. otherwise a
dictionary should be specified to map between schemes.
Miscellaneous
The dictionary file can specify whether the file has a header row by
adding the keyword header in the dictionary. The missing=
keyword can also be used to specify one or more missing value codes, that
are specific to that file.
id1.list A B : header
id2.list B C
id3.list C D : attrib=D header missing=NA,-9
The set command
For an attribute, or part of a joint ID, it is possible to use
the set command to specify that all individuals in that file
have a particular ID value inserted. This can be useful, for example,
if samples from several sources are being grouped, and one wants to
ensure no accidental overlap between samples: e.g. if one site sends
a file site1.txt with individuals
ID
1
2
3
4
and another site sends a similar file, site2.txt, that refers to three different individuals
1
2
3
the dictionary ex2.dict could read
site1.txt ID : set:SITE=1 joint=ID,SITE header
site2.txt ID : set:SITE=2
then
plink --id-dict ex2.dict
will produce a file plink.id that reads
ID SITE
1 1
2 1
3 1
4 1
1 2
2 2
3 2
Note the specific format, with a colon and equals sign but no spaces:
set:field=value
List all instances of an ID across files
To get a list of all instances of an ID value across multiple files, use the command
plink --id-dict ex.dict --id-dump A=a1
will list to the LOG file
Reporting rows that match [ A=a1 ]
id1.txt : A = a1
id1.txt : B = b1
id3.txt : A = a1
id3.txt : C = c1
id3.txt : Sex = M
id3.txt : Source = Wave1
This can be useful in tracking down where incorrect IDs are located across multiple files, for example, in order
to manually resolve inconsistencies, etc.
|
|