1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
FAQ and Hints
This section contains a small but expanding set of answers to questions and hints.
Can I convert my binary PED fileset back into a standard PED/MAP
fileset?
Yes. Use the --recode option, for example:
plink --bfile mydata --recode --out mynewdata
You might also want to use the variant --recode12 and
--recodeAD forms, described here.
To speed up input of a large fileset
As well as using the binary fileformat, which greatly increases speed of
loading relative to the PED/MAP format, if you know that you have
already excluded all the individuals you want (with the per-individual
genotyping threshold option), then setting
--mind 1
will skip the step where per-individual genotyping rates are calculated, which can reduce
the time taken to load the file. Note, the command --all is equivalent to
specifying --mind 1 --geno 1 --maf 0 (i.e. do not apply any filters).
Why are no indidividuals included in the analysis?
A common cause for this is either that all individuals are
non-founders (e.g. a sibling pair dataset) and PLINK, by default, only
uses founders to calculate allele frequencies. The
--non-founders
option can force these individuals in.
An alternative is that none of the individuals have a valid sex
code -- in this case, they are all set to missing status, unless the
--allow-no-sex
option is given. You are strongly recommended to enter the correct
sex codes for all individuals however, so they can be appropriate
treated in any subsequent analyses involving the sex chromosomes.
Why are my results different from an analysis using program X?
This is obviously a difficult question to answer without specific details. Therefore,
if you send me a question along these lines and want to get an answer,
please make it as specific as possible, to put it bluntly! Ideally, include
example data that replicates the problem / illustrates the difference.
There is always the possibility that the difference could be due to a bug in PLINK,
which is obviously something I would want to track down and fix. Similarly, it could be due to
a bug in the other software. Perhaps more likely, the difference might arise from one of two general
sources
- The analytic routines themselves are slightly different. Are the results dramatically different?
Do not expect exact numerically similarity between similar analyses (i.e. even for a simple case, --assoc,
--fisher and --logistic will give slightly different p-values for a simple single SNP
test, but this is to be expected). So, is the difference really meaningful? Perhaps more importantly, are you
sure the other routine really is implementing a similar test, with similar assumptions, etc?
- A common reason for apparent differences between PLINK and other
analysis packages is that PLINK implements some default filtering of the data,
i.e. first removing individuals or SNPs with below threshold genotyping rate. Look at the LOG file to
check that exactly the same set of individuals were actually included in both analyses.
In other words: be sure to check how missing data were handled in each case.
How large a file can PLINK handle?
There are no fixed limits to the size of the data file; it uses currently 1 byte for
4 SNP genotypes and some overhead per SNP and per individual. This means that
you should be able to get datasets of, say, 1 million SNPs and up to 5000 individuals,
in a machine with 2GB RAM without causing too much stress/swapping, etc. That is,
5000 * 1e6 / 4 = 1.25e9 bytes = ~1GB.
Things scale more or less linearly after that. So for a very large file 4
times the size (20K individuals for example), an 8GB or 16GB machine would be
required to load the data in a single run).
For datasets with very many SNPs, even the list of SNP names and storage information
can take a reasonable amount of space, even if the number of individuals is small (i.e.
for the Phase 2 HapMap data, most of the space is taken up with the SNP name and position
information, rather than the genotypes themselves).
You can test the capacity of PLINK and your machine by entering the
commands
plink --dummy 15000 500000 --make-bed --out test1
to generate a dummy file of, in this instance, 15,000 individuals genotyped on 600,000 SNPs.
If you do not get an Out of memory error, then it has worked. Note that dealing with files
this size will take a while. Of course, in many cases it would be easy to split up the data
and do per-chromosome analyses if need be, which would help on smaller machines.
Why does my linear/logistic regression output have all NA's?
PLINK will set the output to be all NAs if it was unable to fit the regression model. Common
causes for this are:
- There is no variation in the phenotype or one or more of the predictor variables: are you sure the right
variables were selected, and that no filters were applied meaning that the individuals left are all cases, for example?
Is the SNP monomorhpic?
- The second reason is that the correlation between predictor variables is too strong. PLINK uses the variance
inflation factor criterion (VIF) to check for multi-collinearity. If two or more variables perfectly predict each
other, PLINK will (correctly) print all NAs to the output, indicating that the model can not be fit. Sometimes,
PLINK may be overly-conservative in calling such problems however, which is particularly likely to occur if you
add more covariates and allow for interactions between terms (as the interaction terms will correlate with the
main effect variables). The default VIF is 10; try setting this value higher with the --vif option, to say
100. The VIF is 1/(1-R) where R is the multiple correlation coefficient between one predictor variable and all others.
A value of 100 implies R=0.99. If one variable or more variables fail the VIF test, then the entire model is not run and
NAs appear in the output.
What kind of computer do I need to run PLINK?
There are no special requirements: PLINK should be able to be compiled
for any machine for which a recent C/C++ compiler is
available. Pre-compiled binary versions are distributed from this
website for Linux, MS-DOS and Mac machines.
In terms of speed, memory and diskspace, obviously more is usually
better. The suggestions below are really minimum values to make life
easy for a "normal" sized study (i.e. many analyses could easily be
run on much smaller machines; some analyses will require more
resources, etc).
The FAQ above about dataset limits gives some
indication of the amount of RAM needed for large studies. Basically,
for any whole genome scale studies you would want at least 2Gb of RAM;
4 or 8Gb would be desirable.
In terms of disk space: the main storage requirements will result from
the raw data (e.g. CEL files, etc) rather than genotype files or most
PLINK results files. However, certain PLINK files can be large: e.g.
.genome files for large samples, dosage output for
whole-genome imputation of all HapMap SNPs, etc. Therefore, a large
hard drive is desirable: not including storage for CEL files, a drive
of at least 200Gb would be good.
PLINK does not specifically take advantage of multi-core
processors. For large datasets, a fast processor is desirable (e.g. at
least 3GHz). The majority of analyses described in these pages can be performed on
a single processor. For certain analyses (e.g. epistasis, using
permutation procedures on very large datasets, IBS calculation on very
large datasets, etc) then access to a parallel computing cluster, if
possible, is very desirable and sometimes necessary.
In terms of operating systems, there should not be major differences
in performance: using a Linux/Unix environment probably has some
advantages in terms of the existing text file processing utilities
typically available, and the more powerful shell scripting options,
but probably personal preference and institutional support is a bigger
consideration. There is a definite advantage to ensuring a C/C++
compiler exists on the system so that the source code version of PLINK
can be compiled for your particular system however -- this may give
some performance advantages and allows access to
the development source code (i.e. to
receive a patched version that fixes a particular problem or adds a
new feature before the next release in generally available).
Can I analyse multiple phenotypes in a single run (e.g. for gene expression datasets)?
For most association commands, you can specify the --all-pheno option to automatically loop over
all phenotypes in an alternate phenotype file:
plink --bfile mydata --pheno phenos.raw --all-pheno --linear --covar covar.dat
If there are N phenotypes, this will generate N separate output files. If
a header row was supplied in the alternate phenotype file, then each file will have the
phenotype name appended (it is up to the user therefore to ensure that the phenotype names
are unique). If not, the output files are simply numbered, P1, P2, etc, (e.g.
plink.P1.assoc, etc).
This works for most basic association commands that consider all SNPs (e.g. --assoc,
--logistic, --fisher, --cmh, etc) but currently not for any
haplotype analysis or epistasis options.
How does PLINK handle the X chromosome in association tests?
By default, in the linear and logistic
(--linear, --logistic) models, for
alleles A and B, males are coded
A -> 0
B -> 1
and females are coded
AA -> 0
AB -> 1
BB -> 2
and additionally sex (0=male,1=female) is also automatically included as a covariate. It is therefore important not
to include sex as a separate covariate in a covariate file ever, but rather to use the special --sex command
that tells PLINK to add sex as coded in the PED/FAM file as the covariate (in this way, it is not double entered for
X chromosome markers). If the sample is all female or all male, PLINK will know not to add sex as an additional covariate
for X chromosome markers.
The basic association tests that are allelic
(--assoc, --mh, etc) do not need any special changes
for X chromosome markers: the above only applies to the linear and
logistic models where the individual, not the allele, is the unit of
analysis. Similarly, the TDT remains unchanged. For
the --model test and Hardy-Weinberg calculations, male X
chromosome genotypes are excluded.
Not all analyses currently handle X chromosomes markers (for example, LD pruning, epistasis, IBS calculations) but support
will be added in future.
Can/why can't gPLINK perform a particular PLINK command?
gPLINK is intended only as a lightweight interface to some of
the basic PLINK commands. It is designed to provide an easy way to
become familiar with PLINK and to perform certain very basic
operations for users who are not yet familiar with command line
interfaces. It is not the recommended mode for using PLINK for
anything beyond the most basic analyses and there are no immediate
plans to extend gPLINK any further to incorporate new commands that
are added to PLINK.
When I include covariates with --linear or --logistic, what do the p-values mean?
If one or more covariates are included (by --covar) when
using --linear or --logistic, PLINK performs a
multiple regression analysis and reports the coefficients and p-values
for each term (i.e. SNP, covariates, any interaction terms). The only
term omitted from the report is the intercept.
The p-values for the covariates do not represent the test for
the SNP-phenotype association after controlling for the
covariate. That is the first row (ADD). Rather, the covariate
term is the test associated with the covariate-phenotype
association. These p-values might be extremely significant (e.g. if
one covaries for smoking in an analysis of heart disease, etc) but this
does not mean that the SNP has a highly significant effect necessarily. For example:
CHR SNP BP A1 TEST NMISS BETA STAT P
1 rs1234567 742429 G ADD 1495 -0.03335 -0.1732 0.8625
1 rs1234567 742429 G COV1 1495 0.1143 9.748 8.321e-022
suggests that the covariate is highly correlated with the outcome
(which will often be already known, presumably), but there is no
evidence that the SNP is in any way correlated with phenotype. These
correspond to the partial regression coefficient terms of a muliple
regression
Y ~ m + b1.ADD + b2.COV1 + e
where p=0.8625 is the Wald test for b1, p=8e-22 is the Wald
test for b2, the covariate-phenotype relationship. To repeat:
it does not mean that the SNP-phenotype test has a p=8e-22 after
controlling for COV1.
|
|