PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Multimarker haplotype tests

All tests described above are based on single SNP tests. It is also possible to impute haplotypes based on multimarker predictors using the standard E-M algorithm and to perform simple tests based on the distribution of probabilistically-inferred set of haplotypes for each individual.

As well as the autosomes, X and haploid chromosomes should be appropriately handled. Phasing can either be based on a sample of unrelated individuals, or certain kinds of family data. First, all founders are phased using the E-M algorithm; then all descendents of these founders are phased given the set of possible parental phases and assuming random-mating. Currently it is not possible to phase sibships without parents. The current implementation of the phasing and haplotype testing algorithm is designed focus on relatively small regions of the genome, rather than to phase whole chromosomes at once.

HINT! Another approach to haplotype-testing can be found under the page describing proxy association. This set of methods essentially just provide a different interface to the exact same E-M phasing and haplotype-testing algorithms, one that is centered around a specific reference SNP.

Specification of haplotypes to be estimated

Haplotype testing in PLINK requires that the user supplies a file listing the haplotypes to be tested (Some precomputed lists are given below which might be useful in some circumstances.) The formats of these files are described below. An alternative is to specify a simple, sliding window of fixed haplotype size (also described below).

The command
plink --file mydata --hap myfile.hlist

will read the file myfile.hlist, each row of which is expected to have one of the three following formats:

1) Particular allele specified

The first format specifies a particular haplotype at a given locus. Two example rows of this format are:
     rs1001 5 0 201  1 2   TC    snp1 snp2
     rs1002 5 0 202  A C   TTA   snp1 snp3 snp4
     ...
The columns represent:
 
     Col 1  : Imputed SNP name
     Col 2  : Imputed SNP chromosome
     Col 3  : Imputed SNP genetic distance (default: Morgan coding)
     Col 4  : Imputed SNP physical position (bp units)
     Col 5  : Imputed SNP allele 1 name
     Col 6  : Imputed SNP allele 2 name
     Col 7  : Tag SNP allele/haplotype that equals imputed SNP allele 1
     Col 8+ : Tag SNP(s) [in same order as haplotype in Col 7]
Here we have explicitly specified the TC and TTA haplotypes. For example, in the first case, SNPs snp1 and snp2 may have all four common haplotypes seen in the sample, TT, CT and CC as well as TC; this command would select only the TC haplotype to be imputed, or as the focus of haplotype analysis. The imputed SNP, rs1001 therefore has the following alleles:
     TC/TC    1/1
     TC/*     1/2
     */*      2/2
and will be positioned on chromosome 5, and base-positon 201. Haplotypes other than TC will be coded 2.

The imputed SNP details (alleles, etc) will only be used if the --hap-impute option has been requested. For --hap-assoc and --hap-tdt options (which consider all possible phases rather than just imputing the most likely) these are not considered (but they are still required in this input file).

2) 'Wildcard' specification

Alternatively, all haplotypes at a given locus above the --maf threshold can be automatically estimated by entering a line in myfile.hlist as follows:
* snp1 snp2 snp3
* snp1 snp2
i.e. where the first character is an asterisk *, which would, taking just the first line for example, create all 3-SNP haplotypes for the SNPs labelled in the MAP file as snp1, snp2 and snp3, above the minor allele frequency threshold. If the haplotypes were, for example, AAC, AGG and TGG, then the following names would be automatically assigned:
     H1_AAC_
     H1_AGG_
     H1_TGG_
Haplotypes based on subsequent lines in the file would be labelled H2_*_, H3_*_, etc. In this case, all two-SNP haplotypes for snp1 and snp2 would start H2_. The chromosome and position flags for the new haplotypes are set to equal the first SNP of the set.

3) 'Named wildcard' specification

Finally, this format is identical to the previous wildcard specification, except a name can be given to the haplotype. This uses ** instead of * to start a row; the second entry is then interpreted as the name of the haplotype locus rather than the first SNP. For example:
** BLOCK1 snp1 snp2 snp3
** BLOCK2 snp6 snp7
The only difference is that BLOCK1 and BLOCK2 names will be used in the output instead of H1 and H2 being assigned automatically.

4) Sliding window specification

Finally, instead of specifying a haplotype file with the --hap option, you can use the --hap-window option to specifty all haplotypes in sliding windows of a fixed number of SNPs (shifting 1 SNP at a time).
plink --bfile mydata --hap-window 3 --hap-assoc

to form all 3-SNP haplotypes across the entire dataset (respecting chromosome boundaries, however). In this case the windows will be automatically named WIN1, WIN2, etc. This command can take a comma-delimited list of values, e.g.
     --hap-window 1,2,3
to perform all single SNP tests (1-SNP haplotypes) as well as sliding windows of all 2-SNP and 3-SNP haplotypes.

Precomputed lists of multimarker tests

Below are links to some PLINK-formatted lists of multimarker tests selected for Affymetrix 500K and Illumina whole genome products, based on consideration of the CEU Phase 2 HapMap (at r-squared=0.8 threshold). One should download the appropriate file and run with the --hap option (after ensuring that any strand issues have been resolved). These files were generated by Itsik Pe'er and others, as described in this manuscript:
     Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D 
     & Daly MJ (2006) Evaluating and improving power in whole-genome 
     association studies using fixed marker sets. Nat Genet, 38(6): 605-6.
These tables list all tags for every common HapMap SNP, at the given r-squared threshold. The same haplotype may therefore appear multiple times (i.e. if it tags more than 1 SNP). The haplotypes are specified in terms of the + (positive) strand relative to the HapMap. You might need to reformat your data prior to using these files (using the --flip command, for instance) before you can use them.

Note These tables obviously assume that all tags on present in the final, post-quality-control dataset: i.e. if certain SNPs have been removed, it will be better to reselect the predictors -- that is, these lists should really only be used as a first pass, for convenience.

Estimating haplotype frequencies

To obtain the haplotype frequencies for all haplotypes in each window, use the option:
plink --file mydata --hap myfile.hlist --hap-freq

which will generate the file
     plink.freq.hap
which contains the fields (no header)
     LOCUS        Haplotype locus / window name
     HAPLOTYPE    Haplotype identifer
     F            Frequency in sample (founders)

Testing for haplotype-based case/control and quantitative trait association

In a population-based sample of unrelated individuals, case/control and quantitative traits can be analysed for haplotype associations, using the option, for example,
plink --file mydata --hap myfile.hlist --hap-assoc

which will generate haplotype-specific tests (1df) for both disease and quantitative traits; for disease traits only, an omnibus association statistic will also be computed. This option generates the file
     plink.assoc.hap
which contains the following fields:
     LOCUS        Haplotype locus / window name
     HAPLOTYPE    Haplotype identifer / "OMNIBUS"
     F_A          Frequency in cases
     F_U          Frequency in controls
     CHISQ        Test for association
     DF           Degrees of freedom
     P            Asymptotic p-value
     SNPS         SNPs forming the haplotype
or
     plink.qassoc.hap
which contains the following fields:
     LOCUS        Haplotype locus / window name
     HAPLOTYPE    Haplotype identifer
     NANAL        Number of individuals in analysis
     BETA         Regression coefficient
     RSQ          Proportion variance explained
     STAT         Test statistic (T)
     P            Asymptotic p-value
     SNPS         SNPs forming the haplotype
In all cases, the tests are based on the expected number of haplotypes each individual has (which might be fractional). The case/control omnibus test is a H-1 degree of freedom test, if there are H haplotypes.

Haplotype-based association tests with GLMs

The following options use linear and logistic regression to perform haplotye-based association analysis. The two main commands, --hap-linear and --hap-logistic are analogous to --linear and --logistic, described here.

The main advantages of these commands over the above approaches, are that they can include one or more covariates and allow for permutation. The disadvantage is that they will run a little more slowly.

The basic command is
plink --file mydata --hap myfile.hlist --hap-logistic

(alternatively, for a quantitative outcome, use --hap-linear; aside from minor differences in the output, the discussion below applies equally to both forms of these commands).

NOTE Here the haplotypes to be tested are specified in a file with the --hap command, but one could alternatively use a sliding window analysis, e.g. to cover all 2, 3 and 4-SNP windows, e.g. --hap-window 2,3,4

The output is in the file
     plink.assoc.hap.logistic
(or plink.assoc.hap.linear) which has the fields:
        NSNP    Number of SNPs in this haplotype
        NHAP    Number of common haplotypes (threshold determined by --mhf, 0.01 default)
         CHR    Chromosome code
         BP1    Physical position of left-most (5') SNP (base-pair)
         BP2    Physical position of right-most (3') SNP (base-pair)
        SNP1    SNP ID of left-most (5') SNP
        SNP2    SNP ID of left-most (3') SNP
   HAPLOTYPE    Haplotype 
           F    Frequency in sample
          OR    Estimated odds ratio
        STAT    Test statistic (T from Wald test)
           P    Asymptotic p-value
for example: (spaces between rows added for clarity)
  NSNP NHAP  CHR       BP1       BP2       SNP1       SNP2  HAPLOTYPE       F      OR    STAT       P
     2    2   22  15462210  15462259 rs11089263 rs11089264         AA   0.345    1.31   0.693   0.405
     2    2   22  15462210  15462259 rs11089263 rs11089264         CG   0.655   0.762   0.693   0.405

     3    3   22  15688352  15690057   rs165650   rs165757        GTG   0.117   0.544    1.46   0.227
     3    3   22  15688352  15690057   rs165650   rs165757        CTG  0.0167   0.406   0.525   0.469
     3    3   22  15688352  15690057   rs165650   rs165757        CGA   0.867     1.7    1.56   0.212

     5    5   22  15691787  15699058   rs175152   rs165914      ACACT   0.129   0.515    2.13   0.144
     5    5   22  15691787  15699058   rs175152   rs165914      CCACT   0.236   0.917  0.0566   0.812
     5    5   22  15691787  15699058   rs175152   rs165914      CCACG  0.0169    1.74   0.198   0.656
     5    5   22  15691787  15699058   rs175152   rs165914      CTGTG   0.085   0.565    1.11   0.292
     5    5   22  15691787  15699058   rs175152   rs165914      CTATG   0.533    1.88    3.36  0.0666

     5    4   22  15902049  15939567  rs2845389  rs4819958      GTAAA  0.0857   0.719   0.388   0.533
     5    4   22  15902049  15939567  rs2845389  rs4819958      GTGAA    0.32    1.04  0.0185   0.892
     5    4   22  15902049  15939567  rs2845389  rs4819958      CCGGG   0.303   0.548    2.97  0.0847
     5    4   22  15902049  15939567  rs2845389  rs4819958      GCGGG   0.292    1.82    3.28  0.0701
which illustrates results for the first four haplotype window positions (e.g. the second window position contains 3 SNPs, and there are 3 common haplotypes, GTG, CTG and CGA).

The additional command
     --hap-omnibus
instructs PLINK to perform instead of H-1 haplotype-specific tests for H haplotypes (of each versus all others), a single H-1 df omnibus test (jointly estimating a testing all haplotype effects at that position). This will result in a single row per window, with the following slightly different format. Now the first four window positions have only a single line of output, and a single p-value (the degree of freedom will be NHAP-1). Also, there is no haplotype-specific output (e.g. haplotype names, frequencies or odds ratios):
   NSNP NHAP  CHR          BP1          BP2       SNP1       SNP2     STAT        P
      2    2   22     15462210     15462259 rs11089263 rs11089264    0.693    0.405
      3    3   22     15688352     15690057   rs165650   rs165757     1.57    0.457
      5    5   22     15691787     15699058   rs175152   rs165914     5.08    0.279
      5    4   22     15902049     15939567  rs2845389  rs4819958      4.4    0.222
As mentioned above, covariates can be incorporated with the
     --covar myfile.txt
command. Note that the coefficients and p-values for the covariates are not listed in these output files (unlike the default for the --logistic command).

Permutation procedures can be used, with the command
     --mperm 10000
to specify, for example, ten thousand permutations. The empirical p-values from this analysis are listed in the file
     plink.assoc.hap.logistic.mperm
Note that there will be no SNP name listed in the permutation output file: rather, it will be in the form:
      TEST         EMP1         EMP2
        T0       0.4158            1
        T1       0.1782            1
        T2       0.2475            1
        T3       0.1683            1
      ...
The number of rows, and the order of the output, will be the same as for the asymptotic results file, so they can be easily aligned. e.g. here T0 would correspond to either the first omnibus test, or the first haplotype-specific test, T1 the second, etc.

Haplotype-based TDT association test

If the case/control data are being analysed, use the option
plink --file mydata --hap myfile.hlist --hap-tdt

to test for TDT haplotype-specific association. This option generates the file
     plink.tdt.hap
which contains the following fields:
     LOCUS        Haplotype locus / window name
     HAPLOTYPE    Haplotype identifer / "OMNIBUS"
     T            Number of transmitted haplotypes
     U            Number of untransmitted haplotypes
     CHISQ        Test for association
     P            Asymptotic p-value

Imputing multimarker haplotypes

If the --hap-impute option is also given, this will create two new files:
plink --file mydata --hap myfile.hlist --hap-impute

will generate the file:
     plink.impute.ped
     plink.impute.map
based on the most likely E-M phase reconstructed haplotypes. One could then simply treat the most likely haplotype assignments as SNPs and use all the standard analytic options of PLINK, e.g. --assoc.

Warning This represents a quick and dirty approach to haplotype testing. Depending on how accurately the haplotypes have been imputed (i.e. the range of maximum posterior probabilities per individual) some bias will be introduced into subsequent tests based on these 'SNPs'. Typically, as long as cases and controls are phased together, as they are here, this bias is likely to be quite small and so should not substantively impact results (unpublished simulation results, SMP). Furthermore, exact methods can be used to refine the association for the putative hits discovered by this approach.

NOTE Future versions will allow for a binary PED file to be created from the --hap-impute command. You do not need to specify --recode when using --hap-impute.

Tabulating individuals' haplotype phases

To obtain a summary of all possible haplotype phases and the corresponding posterior probabilities (i.e. given genotype data), use the command:
plink --file mydata --hap myfile.hlist --hap-phase

which will generate the file
     plink.phase-*
where * is the name of the 'window' (i.e. the row of the haplotype list file). That is, if the haplotype list contains multiple rows, then multiple phase files will be generated. These files contain the fields, where each row is one possible haplotype phase for one individual:
     FID       Family ID
     IID       Individual ID
     PH        Phase number for that individual (0-based)
     HAP1      First haplotype, H1
     HAP2      Second haplotype, H2
     POSTPROB  P(H1,H2 | G ) 
     BEST      1 if most likely phase for that individual
 

This document last modified Wednesday, 25-Jan-2017 11:39:27 EST