PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

FAQ and Hints

This section contains a small but expanding set of answers to questions and hints.


Can I convert my binary PED fileset back into a standard PED/MAP fileset?

Yes. Use the --recode option, for example:
plink --bfile mydata --recode --out mynewdata

You might also want to use the variant --recode12 and --recodeAD forms, described here.

To speed up input of a large fileset

As well as using the binary fileformat, which greatly increases speed of loading relative to the PED/MAP format, if you know that you have already excluded all the individuals you want (with the per-individual genotyping threshold option), then setting
     --mind 1 
will skip the step where per-individual genotyping rates are calculated, which can reduce the time taken to load the file. Note, the command --all is equivalent to specifying --mind 1 --geno 1 --maf 0 (i.e. do not apply any filters).

Why are no indidividuals included in the analysis?

A common cause for this is either that all individuals are non-founders (e.g. a sibling pair dataset) and PLINK, by default, only uses founders to calculate allele frequencies. The
     --non-founders
option can force these individuals in.

An alternative is that none of the individuals have a valid sex code -- in this case, they are all set to missing status, unless the
     --allow-no-sex
option is given. You are strongly recommended to enter the correct sex codes for all individuals however, so they can be appropriate treated in any subsequent analyses involving the sex chromosomes.

Why are my results different from an analysis using program X?

This is obviously a difficult question to answer without specific details. Therefore, if you send me a question along these lines and want to get an answer, please make it as specific as possible, to put it bluntly! Ideally, include example data that replicates the problem / illustrates the difference.

There is always the possibility that the difference could be due to a bug in PLINK, which is obviously something I would want to track down and fix. Similarly, it could be due to a bug in the other software. Perhaps more likely, the difference might arise from one of two general sources
  • The analytic routines themselves are slightly different. Are the results dramatically different? Do not expect exact numerically similarity between similar analyses (i.e. even for a simple case, --assoc, --fisher and --logistic will give slightly different p-values for a simple single SNP test, but this is to be expected). So, is the difference really meaningful? Perhaps more importantly, are you sure the other routine really is implementing a similar test, with similar assumptions, etc?
  • A common reason for apparent differences between PLINK and other analysis packages is that PLINK implements some default filtering of the data, i.e. first removing individuals or SNPs with below threshold genotyping rate. Look at the LOG file to check that exactly the same set of individuals were actually included in both analyses. In other words: be sure to check how missing data were handled in each case.

How large a file can PLINK handle?

There are no fixed limits to the size of the data file; it uses currently 1 byte for 4 SNP genotypes and some overhead per SNP and per individual. This means that you should be able to get datasets of, say, 1 million SNPs and up to 5000 individuals, in a machine with 2GB RAM without causing too much stress/swapping, etc. That is,
     5000 * 1e6 / 4 = 1.25e9 bytes = ~1GB.
Things scale more or less linearly after that. So for a very large file 4 times the size (20K individuals for example), an 8GB or 16GB machine would be required to load the data in a single run).

For datasets with very many SNPs, even the list of SNP names and storage information can take a reasonable amount of space, even if the number of individuals is small (i.e. for the Phase 2 HapMap data, most of the space is taken up with the SNP name and position information, rather than the genotypes themselves).

You can test the capacity of PLINK and your machine by entering the commands
plink --dummy 15000 500000 --make-bed --out test1

to generate a dummy file of, in this instance, 15,000 individuals genotyped on 600,000 SNPs. If you do not get an Out of memory error, then it has worked. Note that dealing with files this size will take a while. Of course, in many cases it would be easy to split up the data and do per-chromosome analyses if need be, which would help on smaller machines.

Why does my linear/logistic regression output have all NA's?

PLINK will set the output to be all NAs if it was unable to fit the regression model. Common causes for this are:
  • There is no variation in the phenotype or one or more of the predictor variables: are you sure the right variables were selected, and that no filters were applied meaning that the individuals left are all cases, for example? Is the SNP monomorhpic?
  • The second reason is that the correlation between predictor variables is too strong. PLINK uses the variance inflation factor criterion (VIF) to check for multi-collinearity. If two or more variables perfectly predict each other, PLINK will (correctly) print all NAs to the output, indicating that the model can not be fit. Sometimes, PLINK may be overly-conservative in calling such problems however, which is particularly likely to occur if you add more covariates and allow for interactions between terms (as the interaction terms will correlate with the main effect variables). The default VIF is 10; try setting this value higher with the --vif option, to say 100. The VIF is 1/(1-R) where R is the multiple correlation coefficient between one predictor variable and all others. A value of 100 implies R=0.99. If one variable or more variables fail the VIF test, then the entire model is not run and NAs appear in the output.

What kind of computer do I need to run PLINK?

There are no special requirements: PLINK should be able to be compiled for any machine for which a recent C/C++ compiler is available. Pre-compiled binary versions are distributed from this website for Linux, MS-DOS and Mac machines.

In terms of speed, memory and diskspace, obviously more is usually better. The suggestions below are really minimum values to make life easy for a "normal" sized study (i.e. many analyses could easily be run on much smaller machines; some analyses will require more resources, etc).

The FAQ above about dataset limits gives some indication of the amount of RAM needed for large studies. Basically, for any whole genome scale studies you would want at least 2Gb of RAM; 4 or 8Gb would be desirable.

In terms of disk space: the main storage requirements will result from the raw data (e.g. CEL files, etc) rather than genotype files or most PLINK results files. However, certain PLINK files can be large: e.g. .genome files for large samples, dosage output for whole-genome imputation of all HapMap SNPs, etc. Therefore, a large hard drive is desirable: not including storage for CEL files, a drive of at least 200Gb would be good.

PLINK does not specifically take advantage of multi-core processors. For large datasets, a fast processor is desirable (e.g. at least 3GHz). The majority of analyses described in these pages can be performed on a single processor. For certain analyses (e.g. epistasis, using permutation procedures on very large datasets, IBS calculation on very large datasets, etc) then access to a parallel computing cluster, if possible, is very desirable and sometimes necessary.

In terms of operating systems, there should not be major differences in performance: using a Linux/Unix environment probably has some advantages in terms of the existing text file processing utilities typically available, and the more powerful shell scripting options, but probably personal preference and institutional support is a bigger consideration. There is a definite advantage to ensuring a C/C++ compiler exists on the system so that the source code version of PLINK can be compiled for your particular system however -- this may give some performance advantages and allows access to the development source code (i.e. to receive a patched version that fixes a particular problem or adds a new feature before the next release in generally available).

Can I analyse multiple phenotypes in a single run (e.g. for gene expression datasets)?

For most association commands, you can specify the --all-pheno option to automatically loop over all phenotypes in an alternate phenotype file:
plink --bfile mydata --pheno phenos.raw --all-pheno --linear --covar covar.dat

If there are N phenotypes, this will generate N separate output files. If a header row was supplied in the alternate phenotype file, then each file will have the phenotype name appended (it is up to the user therefore to ensure that the phenotype names are unique). If not, the output files are simply numbered, P1, P2, etc, (e.g. plink.P1.assoc, etc).

This works for most basic association commands that consider all SNPs (e.g. --assoc, --logistic, --fisher, --cmh, etc) but currently not for any haplotype analysis or epistasis options.

How does PLINK handle the X chromosome in association tests?

By default, in the linear and logistic (--linear, --logistic) models, for alleles A and B, males are coded
     A   ->   0
     B   ->   1
and females are coded
     AA  ->   0
     AB  ->   1
     BB  ->   2
and additionally sex (0=male,1=female) is also automatically included as a covariate. It is therefore important not to include sex as a separate covariate in a covariate file ever, but rather to use the special --sex command that tells PLINK to add sex as coded in the PED/FAM file as the covariate (in this way, it is not double entered for X chromosome markers). If the sample is all female or all male, PLINK will know not to add sex as an additional covariate for X chromosome markers.

The basic association tests that are allelic (--assoc, --mh, etc) do not need any special changes for X chromosome markers: the above only applies to the linear and logistic models where the individual, not the allele, is the unit of analysis. Similarly, the TDT remains unchanged. For the --model test and Hardy-Weinberg calculations, male X chromosome genotypes are excluded.

Not all analyses currently handle X chromosomes markers (for example, LD pruning, epistasis, IBS calculations) but support will be added in future.

Can/why can't gPLINK perform a particular PLINK command?

gPLINK is intended only as a lightweight interface to some of the basic PLINK commands. It is designed to provide an easy way to become familiar with PLINK and to perform certain very basic operations for users who are not yet familiar with command line interfaces. It is not the recommended mode for using PLINK for anything beyond the most basic analyses and there are no immediate plans to extend gPLINK any further to incorporate new commands that are added to PLINK.

When I include covariates with --linear or --logistic, what do the p-values mean?

If one or more covariates are included (by --covar) when using --linear or --logistic, PLINK performs a multiple regression analysis and reports the coefficients and p-values for each term (i.e. SNP, covariates, any interaction terms). The only term omitted from the report is the intercept.

The p-values for the covariates do not represent the test for the SNP-phenotype association after controlling for the covariate. That is the first row (ADD). Rather, the covariate term is the test associated with the covariate-phenotype association. These p-values might be extremely significant (e.g. if one covaries for smoking in an analysis of heart disease, etc) but this does not mean that the SNP has a highly significant effect necessarily. For example:
   CHR        SNP      BP  A1   TEST  NMISS      BETA     STAT           P 
     1  rs1234567  742429   G    ADD   1495  -0.03335  -0.1732      0.8625
     1  rs1234567  742429   G   COV1   1495    0.1143    9.748  8.321e-022
suggests that the covariate is highly correlated with the outcome (which will often be already known, presumably), but there is no evidence that the SNP is in any way correlated with phenotype. These correspond to the partial regression coefficient terms of a muliple regression
  Y ~ m + b1.ADD + b2.COV1 + e 
where p=0.8625 is the Wald test for b1, p=8e-22 is the Wald test for b2, the covariate-phenotype relationship. To repeat: it does not mean that the SNP-phenotype test has a p=8e-22 after controlling for COV1.
 

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST