PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

LD calculations

PLINK includes a set of options to calculate pairwise linkage disequilibrium between SNPs, and to present or process this information in various ways. Also see the functions on haplotype analyisis.

Pairwise LD measures for a single pair of SNPs

The command --ld followed by two SNP identifiers prints the following LD statistics to the LOG file, for a single pair of SNPs: r-squared, D', the estimated haplotype frequencies and those expected under linkage equilibrium, and indicates which haplotypes are in phase (i.e. occuring more often than expected by chance). For example:
plink --bfile mydata --ld rs2840528 rs7545940

gives the following output
     LD information for SNP pair [ rs2840528 rs7545940 ]

        R-sq = 0.592     D' = 0.936

        Haplotype     Frequency    Expectation under LE
        ---------     ---------    --------------------
            GC          0.013            0.199
            AC          0.435            0.245
            GT          0.441            0.250
            AT          0.111            0.307

        In phase alleles are GT/AC
The LD statistics presented here are based on haplotype frequencies estimated via the EM algorithm. Only founders are used in these calculations.

Pairwise LD measures for multiple SNPs (genome-wide)

Correlations based on genotype allele counts (i.e. w/out phasing, and for founders only) can be obtained with the commands
plink --file mydata --r

or

plink --file mydata --r2

That is, this calculates for each SNP the correlation between two variables, coded 0, 1 or 2 to represent the number of non-reference alleles at each. The squared correlation based on genotypic allele counts is therefore not identical to the r-sq as estimated from haplotype frequencies (see above), although it will typically be very similar. Because it is faster to calculate, it provides a good way to screen for strong LD. The estimated value for the example in the section above (rs2840528,rs7545940) is 0.5748 (versus 0.592).

Both commands create a file called
	plink.ld
with a list of R or R-squared values in it.
Filtering the output
By default, several filters on imposed on which pairwise calculations are calculated and reported. To only analyse SNPs that are not more than 10 SNPs apart, for example, use the option (default is 10 SNPs)
     --ld-window 10
to specify a kb window in addition (default 1Mb)
     --ld-window-kb 1000
and to report only values above a particular value (this only applies when the --r2 and not the --r command is used) (default is 0.2)
     --ld-window-r2 0.2
The default for --ld-window-r2 is set at 0.2 to reduce the size of output files when many comparisons are made: to get all pairs reported, set --ld-window-r2 to 0.
Obtaining LD values for a specific SNP versus all others
To obtain all LD values for a set of SNPs versus one specific SNP, use the --ld-snp command in conjunction with --r2. For example, to get a list of all values for every SNP within 1Mb of rs12345, use the command
    plink --file mydata 
          --r2 
          --ld-snp rs12345 
          --ld-window-kb 1000 
          --ld-window 99999 
          --ld-window-r2 0

The --ld-window and --ld-window-r2 commands effectively means that output will be shown for all other SNPs within 1Mb of rs12345.

Similar to the --ld-snp command, but for multiple seed SNPs: to obtain all LD values from a group of SNPs with other SNPs, use the command
     --ld-snp-list mysnps.txt
where mysnps.txt is a list of SNPs.
Obtaining a matrix of LD values
Alternatively, it is possible to add the --matrix option, which creates a matrix of LD values rather than a list: in this case, all SNP pairs are calculated and reported, even for SNPs on different chromosomes.

Note To force all SNP-by-SNP cross-chromosome comparisons with the standard output format (e.g. without --matrix) add the flag
     --inter-chr
instead. This can be combined with --ld-window-r2, for example to list all inter-chromosomal SNPs pairs with very high R-squared values. Warning: this command could take an excessively long time to run if applied to large datasets with many SNPs.

Functions to select tag SNPs for specified SNP sets

The command
plink --bfile mydata --show-tags mysnps.txt

where mysnps.txt is just a list of SNP IDs, generates a file
     plink.tags
that lists all the SNPs in the dataset that tag the SNPs in mysnps.txt (including the SNPs in the original file). A message is also written to the LOG file that indicates how many new SNPs were added
     Reading SNPs to tag from [ mysnps.txt ]
     Read 10 SNPs to tag, of which 10 are unique and present
     In total, added 2 tag SNPs
     Writing tag list to [ plink.tags ]
meaning that plink.tags will contain 12 SNPs. This command could be useful, for example, if one wants to generate a list of SNPs that tag all known coding SNPs, or a list of known disease-associated SNPs.

If the option
     --list-all
is also added, then an additional file is generated that gives some more details for each target SNP (i.e. each SNP listed in mysnps.txt, in the above example) regarding how many and which tags were set for it. The file is named
     plink.tags.list
and has the following fields
       SNP   Target SNP ID
       CHR   Chromosome code
        BP   Physical position (base-pair)
      NTAG   Number of other SNPs that tag this SNP
      LEFT   Physical position of left-most (5') tagging SNP (bp)
     RIGHT   Physical position of right-most (3') tagging SNP (bp)
    KBSPAN   Kilobase size of region implied by LEFT-RIGHT
      TAGS   List of SNPs that tag target
For example:

            SNP  CHR         BP NTAG       LEFT      RIGHT   KBSPAN TAGS
      rs2542334   22   16694612    2   16693517   16695440    1.923 rs415170|rs2587108
      rs2587108   22   16695440    2   16693517   16695440    1.923 rs415170|rs2542334
       rs873387   22   16713566    0   16713566   16713566        0 NONE
        rs11917   22   16717565    2   16717565   16742194   24.629 rs1057721|rs2075444
      rs1057721   22   16718397    2   16717565   16742194   24.629 rs11917|rs2075444
      rs9605422   22   16737494    0   16737494   16737494        0 NONE
      rs2075444   22   16742194    2   16717565   16742194   24.629 rs11917|rs1057721
      rs4819644   22   16744470    0   16744470   16744470        0 NONE
      rs2083882   22   16769795    0   16769795   16769795        0 NONE
      rs5992907   22   16796453    5   16796453   16830384   33.931 rs400509|rs396012|rs415651|rs384215|rs453557
       rs400509   22   16800853    3   16796453   16813039   16.586 rs5992907|rs396012|rs384215
       rs396012   22   16806587    3   16796453   16813039   16.586 rs5992907|rs400509|rs384215
      rs7293187   22   16807274    0   16807274   16807274        0 NONE

The settings for declaring that a SNP tags another SNP can be varied with the commands
     --tag-r2 0.5
to specify a minimum r-squared (based on the genotypic correlation, see above); in this case it is set to a value of 0.5 as being necessary to declare that one SNP tags another (the default is 0.8). Also,
     --tag-kb 1000
will constrain the search for tags to be within a megabase (the default is 250kb).

HINT If you specify the filename for the --show-tags command to be the keyword all, then PLINK will only generate the plink.tags.list file, but for all SNPs in the dataset. (This means that you cannot have a file actually called all used as the input for the --show-tags command of course).

NOTE You can add the --tag-mode2 command to specify an alternative input and output format. In this case, we assume the input file contains two columns, with the second field being either 0 or 1 to indicate whether or not this is a target SNP:
     rs00001  0
     rs00002  0
     rs00003  1
     rs00004  0
     rs00005  1
     rs00006  0
The output is in a similar form, except that tagging SNPs will now have a 1 in the second field:
     rs00001  0
     rs00002  0
     rs00003  1
     rs00004  1
     rs00005  1
     rs00006  1
i.e. this above example would be equivalent to the original input file
     rs00003  
     rs00005  
and output file
     rs00003  
     rs00004  
     rs00005  
     rs00006  
indicating that SNPs rs00004 and rs00006 have been added as tags.

NOTE This function does not pick the minimal set of SNPs required to tag all common variation in a region, in the way tagging algorithms typically work (e.g. such as Tagger). Rather, this utility function is designed merely to indicate which other SNPs tag a one or more of a pre-specified list of SNPs.

Haplotyp block estimation

The command
plink --bfile mydata --blocks

generates two files
     plink.blocks
and
     plink.blocks.det
Haplotype blocks are estimated following the default procedure in Haploview. Note that only individuals with a non-missing phenotype are included in this analysis.

By default, pairwise LD is only calculated for SNPs within 200kb. If needed, this parameter can be changed via the --ld-window-kb option.

The first file lists each block (2 or more SNPs) on a row, starting with an asterisk symbol (*), for example:
     * rs7527871 rs2840528 rs7545940
     * rs2296442 rs2246732
     * rs10752728 rs897635
     * rs10489588 rs9661525 rs2993510
This format can be used with the --hap command, for example to test each haplotype in each block for assocaition, or to estimate the haplotype frequencies: for example,
plink --bfile mydata --hap plink.blocks --hap-freq

The second file, plink.blocks.det is similar to the first, but contains some addition information:
     CHR      Chromosome identifier
     BP1      The start position (base-pair units) of this block
     BP2      The end position (base-pair units) of this block
     KB       The kilobase distanced spanned by this block
     NSNPS    The number of SNPs in this block
     SNPS     List of SNPs in this block
for example
     CHR          BP1          BP2           KB  NSNPS SNPS
       1      2313888      2331789       17.902      3 rs7527871|rs2840528|rs7545940
       1      2462779      2482556       19.778      2 rs2296442|rs2246732
       1      2867411      2869431        2.021      2 rs10752728|rs897635
       1      2974991      2979823        4.833      3 rs10489588|rs9661525|rs2993510
       ....
 
This document last modified Wednesday, 25-Jan-2017 11:39:27 EST