1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
LD calculations
PLINK includes a set of options to calculate pairwise linkage
disequilibrium between SNPs, and to present or process this
information in various ways. Also see the functions
on haplotype analyisis.
Pairwise LD measures for a single pair of SNPs
The command --ld followed by two SNP identifiers prints the
following LD statistics to the LOG file, for a single pair of SNPs:
r-squared, D', the estimated haplotype frequencies and those expected
under linkage equilibrium, and indicates which haplotypes are in phase
(i.e. occuring more often than expected by chance). For example:
plink --bfile mydata --ld rs2840528 rs7545940
gives the following output
LD information for SNP pair [ rs2840528 rs7545940 ]
R-sq = 0.592 D' = 0.936
Haplotype Frequency Expectation under LE
--------- --------- --------------------
GC 0.013 0.199
AC 0.435 0.245
GT 0.441 0.250
AT 0.111 0.307
In phase alleles are GT/AC
The LD statistics presented here are based on haplotype frequencies
estimated via the EM algorithm. Only founders are used in these
calculations.
Pairwise LD measures for multiple SNPs (genome-wide)
Correlations based on genotype allele counts (i.e. w/out phasing, and
for founders only) can be obtained with the commands
plink --file mydata --r
or
plink --file mydata --r2
That is, this calculates for each SNP the correlation between two
variables, coded 0, 1 or 2 to represent the number of non-reference
alleles at each. The squared correlation based on genotypic allele
counts is therefore not identical to the r-sq as estimated from
haplotype frequencies (see above), although it will typically be very
similar. Because it is faster to calculate, it provides a good way to
screen for strong LD. The estimated value for the example in the
section above (rs2840528,rs7545940) is 0.5748 (versus 0.592).
Both commands create a file called
plink.ld
with a list of R or R-squared values in it.
Filtering the output
By default, several filters on imposed on which pairwise calculations
are calculated and reported. To only analyse SNPs that are not more
than 10 SNPs apart, for example, use the option (default is 10 SNPs)
--ld-window 10
to specify a kb window in addition (default 1Mb)
--ld-window-kb 1000
and to report only values above a particular value (this only applies when the --r2
and not the --r command is used) (default is 0.2)
--ld-window-r2 0.2
The default for --ld-window-r2 is set at 0.2 to reduce the
size of output files when many comparisons are made: to get all pairs
reported, set --ld-window-r2 to 0.
Obtaining LD values for a specific SNP versus all others
To obtain all LD values for a set of SNPs versus one specific SNP, use the --ld-snp
command in conjunction with --r2. For example, to get a list of all values for
every SNP within 1Mb of rs12345, use the command
plink --file mydata
--r2
--ld-snp rs12345
--ld-window-kb 1000
--ld-window 99999
--ld-window-r2 0
The --ld-window and --ld-window-r2 commands effectively means that output
will be shown for all other SNPs within 1Mb of rs12345.
Similar to the --ld-snp command, but for multiple seed SNPs:
to obtain all LD values from a group of SNPs with other SNPs, use the
command
--ld-snp-list mysnps.txt
where mysnps.txt is a list of SNPs.
Obtaining a matrix of LD values
Alternatively, it is possible to add the --matrix option,
which creates a matrix of LD values rather than a list: in this case,
all SNP pairs are calculated and reported, even for SNPs on different
chromosomes.
Note To force all SNP-by-SNP cross-chromosome comparisons
with the standard output format (e.g. without --matrix) add the flag
--inter-chr
instead. This can be combined
with --ld-window-r2, for example to list all
inter-chromosomal SNPs pairs with very high R-squared
values. Warning: this command could take an excessively long
time to run if applied to large datasets with many SNPs.
Functions to select tag SNPs for specified SNP sets
The command
plink --bfile mydata --show-tags mysnps.txt
where mysnps.txt is just a list of SNP IDs, generates a file
plink.tags
that lists all the SNPs in the dataset that tag the SNPs
in mysnps.txt (including the SNPs in the original file).
A message is also written to the LOG file that indicates how many new
SNPs were added
Reading SNPs to tag from [ mysnps.txt ]
Read 10 SNPs to tag, of which 10 are unique and present
In total, added 2 tag SNPs
Writing tag list to [ plink.tags ]
meaning that plink.tags will contain 12 SNPs. This command
could be useful, for example, if one wants to generate a list of SNPs
that tag all known coding SNPs, or a list of known disease-associated
SNPs.
If the option
--list-all
is also added, then an additional file is generated that gives some
more details for each target SNP (i.e. each SNP listed
in mysnps.txt, in the above example) regarding how many and
which tags were set for it. The file is named
plink.tags.list
and has the following fields
SNP Target SNP ID
CHR Chromosome code
BP Physical position (base-pair)
NTAG Number of other SNPs that tag this SNP
LEFT Physical position of left-most (5') tagging SNP (bp)
RIGHT Physical position of right-most (3') tagging SNP (bp)
KBSPAN Kilobase size of region implied by LEFT-RIGHT
TAGS List of SNPs that tag target
For example:
SNP CHR BP NTAG LEFT RIGHT KBSPAN TAGS
rs2542334 22 16694612 2 16693517 16695440 1.923 rs415170|rs2587108
rs2587108 22 16695440 2 16693517 16695440 1.923 rs415170|rs2542334
rs873387 22 16713566 0 16713566 16713566 0 NONE
rs11917 22 16717565 2 16717565 16742194 24.629 rs1057721|rs2075444
rs1057721 22 16718397 2 16717565 16742194 24.629 rs11917|rs2075444
rs9605422 22 16737494 0 16737494 16737494 0 NONE
rs2075444 22 16742194 2 16717565 16742194 24.629 rs11917|rs1057721
rs4819644 22 16744470 0 16744470 16744470 0 NONE
rs2083882 22 16769795 0 16769795 16769795 0 NONE
rs5992907 22 16796453 5 16796453 16830384 33.931 rs400509|rs396012|rs415651|rs384215|rs453557
rs400509 22 16800853 3 16796453 16813039 16.586 rs5992907|rs396012|rs384215
rs396012 22 16806587 3 16796453 16813039 16.586 rs5992907|rs400509|rs384215
rs7293187 22 16807274 0 16807274 16807274 0 NONE
The settings for declaring that a SNP tags another SNP can be varied with the
commands
--tag-r2 0.5
to specify a minimum r-squared (based on the genotypic correlation,
see above); in this case it is set to a value of 0.5 as being
necessary to declare that one SNP tags another (the default is 0.8). Also,
--tag-kb 1000
will constrain the search for tags to be within a megabase (the default
is 250kb).
HINT If you specify the filename for
the --show-tags command to be the keyword all, then
PLINK will only generate the plink.tags.list file, but for
all SNPs in the dataset. (This means that you cannot have a file
actually called all used as the input for
the --show-tags command of course).
NOTE You can add the --tag-mode2 command to
specify an alternative input and output format. In this case, we
assume the input file contains two columns, with the second field being
either 0 or 1 to indicate whether or not this is a target SNP:
rs00001 0
rs00002 0
rs00003 1
rs00004 0
rs00005 1
rs00006 0
The output is in a similar form, except that tagging SNPs will now have a 1 in the second field:
rs00001 0
rs00002 0
rs00003 1
rs00004 1
rs00005 1
rs00006 1
i.e. this above example would be equivalent to the original input file
rs00003
rs00005
and output file
rs00003
rs00004
rs00005
rs00006
indicating that SNPs rs00004 and rs00006 have been added as tags.
NOTE This function does not pick the minimal set of
SNPs required to tag all common variation in a region, in the way
tagging algorithms typically work (e.g. such
as Tagger). Rather,
this utility function is designed merely to indicate which other SNPs
tag a one or more of a pre-specified list of SNPs.
Haplotyp block estimation
The command
plink --bfile mydata --blocks
generates two files
plink.blocks
and
plink.blocks.det
Haplotype blocks are estimated following the default procedure in Haploview. Note
that only individuals with a non-missing phenotype are included in
this analysis.
By default, pairwise LD is only calculated for SNPs within 200kb. If
needed, this parameter can be changed via the --ld-window-kb
option.
The first file lists each block (2 or more SNPs) on a row, starting
with an asterisk symbol (*), for example:
* rs7527871 rs2840528 rs7545940
* rs2296442 rs2246732
* rs10752728 rs897635
* rs10489588 rs9661525 rs2993510
This format can be used with the --hap command, for example
to test each haplotype in each block for assocaition, or to estimate
the haplotype frequencies: for example,
plink --bfile mydata --hap plink.blocks --hap-freq
The second file, plink.blocks.det is similar to the first, but
contains some addition information:
CHR Chromosome identifier
BP1 The start position (base-pair units) of this block
BP2 The end position (base-pair units) of this block
KB The kilobase distanced spanned by this block
NSNPS The number of SNPs in this block
SNPS List of SNPs in this block
for example
CHR BP1 BP2 KB NSNPS SNPS
1 2313888 2331789 17.902 3 rs7527871|rs2840528|rs7545940
1 2462779 2482556 19.778 2 rs2296442|rs2246732
1 2867411 2869431 2.021 2 rs10752728|rs897635
1 2974991 2979823 4.833 3 rs10489588|rs9661525|rs2993510
....
|
|