1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Familybased association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Metaanalysis
21. Annotation
22. LDbased results clumping
23. Genebased report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. Rplugins
28. Annotation weblookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flowchart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK


Association analysis
The basic association test is for a disease trait and is based on
comparing allele frequencies between cases and controls (asymptotic
and empirical pvalues are available). Also implemented are the
CochranArmitage trend test, Fisher's exact test, different genetic models
(dominant, recessive and general), tests for stratified samples (e.g.
CochranMantelHaenszel, BreslowDay tests), a test for a quantitative
trait; a test for differences in missing genotype rate between cases and
controls; multilocus tests, using either Hotelling's T(2) statistic or
a sumstatistic approach (evaluated by permutation) as well as haplotype tests. The basic
tests can be performed with permutation, described in the
following section to provide empirical
pvalues, and allow for different designs (e.g. by use of structured,
withincluster permutation). Familybased tests are described in
the next section
HINT The basic association commands (assoc,
model, fisher, linear and logistic) will
test only a single phenotype. If your alternate phenotype file contains more than one
phenotype, then adding the allpheno flag will make PLINK cycle over each
phenotype, e.g. instead of a single plink.assoc output file, if there are
100 phenotypes, PLINK will now show
plink.P1.assoc
plink.P2.assoc
...
plink.P100.assoc
Naturally, it will take 100 times longer... If you are testing a very large number
of phenotypes, it might be worth specifying pfilter also, to reduce the
amount of amount (e.g. only outputing tests significant at p=1e4 if pfilter
1e4 is specified).
Basic case/control association test
To perform a standard case/control association analysis, use the option:
plink file mydata assoc
which generates a file
plink.assoc
which contains the fields:
CHR Chromosome
SNP SNP ID
BP Physical position (basepair)
A1 Minor allele name (based on whole sample)
F_A Frequency of this allele in cases
F_U Frequency of this allele in controls
A2 Major allele name
CHISQ Basic allelic test chisquare (1df)
P Asymptotic pvalue for this test
OR Estimated odds ratio (for A1, i.e. A2 is reference)
Hint In addition, if the optional command
ci X (where X is the desired coverage for a
confidence interval, e.g. 0.95 or 0.99) is included, then two extra
fields are appended to this output:
L95 Lower bound of 95% confidence interval for odds ratio
U95 Upper bound of 95% confidence interval for odds ratio
(where 95 would change if a different value was used with the
ci option, naturally).
Adding the option
counts
with assoc will make PLINK report allele counts, rather than frequencies, in cases and controls.
See the next section on permutation to learn how
to generate empirical pvalues and use other aspects of permutationbased
testing.
See the section on multimarker tests to learn how to perform haplotypebased
tests of association.
This analysis should appropriately handle X/Y chromosome SNPs automatically.
Fisher's Exact test (allelic association)
To perform a standard case/control association analysis using
Fisher's exact test to generate significance, use the option:
plink file mydata fisher
which generates a file
plink.fisher
which contains the fields:
CHR Chromosome
SNP SNP ID
BP Physical position (basepair)
A1 Minor allele name (based on whole sample)
F_A Frequency of this allele in cases
F_U Frequency of this allele in controls
A2 Major allele name
P Exact pvalue for this test
OR Estimated odds ratio (for A1)
As described below, if fisher is specified with model as well,
PLINK will perform genotypic tests using Fisher's exact test.
Note You can also use permutation to generate exact,
empirical significance values that would also be valid in small samples,
etc.
Alternate / full model association tests
It is possible to perform tests of association between a disease and a variant other than the
basic allelic test (which compares frequencies of alleles in cases versus controls), by using
the model option. The tests offered here are (in addition to the basic allelic test):
 CochranArmitage trend test
 Genotypic (2 df) test
 Dominant gene action (1df) test
 Recessive gene action (1df) test
One advantage of the CochranArmitage test is that it does not assume HardyWeinberg equilibrium,
as the individual, not the allele, is the unit of analysis (although the permutationbased empirical
pvalues from the basic allelic test also have this property). It is important to remember that SNPs
showing severe deviations from HardyWeinberg are often likely to be bad
SNPs, or reflect stratification in the sample, however, and so are probably
best excluded in many cases.
The genotypic test provides a general test of association in the 2by3 table of diseasebygenotype. The
dominant and recessive models are tests for the minor allele (which is the
minor allele can be found in the output of either the assoc
or the freq commands. That is, if D is the minor
allele (and d is the major allele):
Allelic: D versus d
Dominant: (DD, Dd) versus dd
Recessive: DD versus (Dd, dd)
Genotypic: DD versus Dd versus dd
As mentioned above, these tests are generated with option:
plink file mydata model
which generates a file
plink.model
which contains the following fields:
CHR Chromosome number
SNP SNP identifier
TEST Type of test
AFF Genotypes/alleles in cases
UNAFF Genotypes/alleles in controls
CHISQ Chisquated statistic
DF Degrees of freedom for test
P Asymptotic pvalue
Each SNP will feature on five rows of the output, correspondnig to the
five tests applied. The column TEST refers to
either ALLELIC, TREND, GENO,
DOM or REC, refering to the different types of test
mentioned above. The genotypic or allelic counts are given for cases
and controls separately. For recessive and dominant tests, the counts
represent the genotypes, with two of the classes pooled.
These tests only consider diploid genotypes: that is, for the X
chromosome males will be excluded even from the ALLELIC test. This way
the same data are used for the five tests presented here. Note that,
in contrast, the basic association commands (assoc
and linear, etc) include single male X chromosomes, and so the
results may differ.
The genotypic and dominant/recessive tests will only be conducted if
there is a minimum number of observations per cell in the 2by3
table: by default, if at least one of the cells has a frequency less
than 5, then we skip the alternate tests (NA is written in
the results file). The CochranArmitage and allelic tests are
performed in all cases. This threshold can be altered with the
cell option:
plink file mydata model cell 20
If permutation (with the mperm or perm options)
is specified, the model option will by default perform a
permutation test based on the most significant result
of ALLELIC, DOM and REC models. That is,
for each SNP, the best original result will be compared against the
best of these three tests for that SNP for every replicate. In max(T)
permutation mode, this will also be compared against the best result
from all SNPs for the EMP2 field. This procedure controls for
the fact that we have selected the best out of three correlated tests
for each SNP. The output will be generated in the file
plink.model.best.perm
or
plink.model.best.mperm
depending on whether adaptive or max(T) permutation was used.
The behavior of the model command can be changed by adding
the modelgen, modeltrend, modeldom
or modelrec flags to make the permutation use the
genotypic, the CochramArmitage trend test, the dominant test or the
recessive test as the basis for permutation instead. In this case, one
of the the following files will be generated:
plink.model.gen.perm plink.model.gen.mperm
plink.model.trend.perm plink.model.trend.mperm
plink.model.dom.perm plink.model.dom.mperm
plink.model.rec.perm plink.model.rec.mperm
It is also possible to add the fisher flag to obtain exact pvalues:
./plink bfile mydata model fisher
in which case the CHISQ field does not appear. Note
that the genotypic, allelic, dominant and recessive models use the
Fisher's exact; the trendtest does not and will give the same pvalue as without
the fisher flag. Also, by default, when fisher is added, the
cell field is set to 0, i.e. to include all SNPs.
Stratified analyses
When a cluster variable has been specified, by pointing to a file
that contains this information, with the within command,
it is possible to perform a number of tests of case/control association
that take this clustering into account, or explicitly test for
homogeneity of effect between clusters.
Note In many cases, permutation procedures can also
be used to account for clusters in
the data. See the next section for more details.
The tests presented below
are only applicable for case/control data, so permutation might be useful
for quantitative trait outcomes,
etc.
There are two basic classes of test:
 Testing for overall disease/gene association, controlling for clusters
 Testing for heterogeneity of the disease/gene assocation between different clusters
The type of cluster structure will vary in terms of how many clusters
there are in the sample, and how many
people belong to each cluster. At one extreme, we might have two only 2
clusters in the sample, each with
a large number of cases and controls. At the other extreme, we might have
a very large number of clusters,
such that each cluster only has 2 individuals. These factors will
influence the choice of stratified
analysis.
The tests offered are:
 CochranMantelHaenszel test for 2x2xK stratified tables
 CochranMantelHaenszel test for IxJxK stratified tables
 BreslowDay test of homogeneity of odds ratio
 Partitioning the total association chisquare to perform between and within cluster
association, and a test of homogeneity of effect
The CochranMantelHaenszel (CMH) tests are valid with both a large
number of small clusters and a small
number of large clusters. These tests provide a test based on an
"average" odds ratio that controls for the potential confounding due to
the cluster variable.
The BreslowDay test asks whether different clusters have different
disease/gene odds ratios: this test
assumes a moderate sample size within each cluster. The partitioning
total association test, which is
conceptually similar to the BreslowDay test, also makes the same
assumption.
As mentioned above, the CMH test comes in two flavours: 2x2xK and
IxJxK. Currently, the 2x2xK test represents a { disease x SNP 
cluster } test. The generalized form, the IxJxK, represents a
test of { cluster x SNP  disease }, i.e. does the SNP vary
between clusters, controlling for any possible true SNP/disease
association. This latter test might be useful in interpreting
significant associations in stratified samples. Typically, the first
form of the test will be of more interest, however. These two tests
are run by using the options:
plink file mydata mh within mycluster.dat
for the basic CMH test, or
plink file mydata mh2 within mycluster.dat
for the IxJxK test.
The mh option generates the file
plink.cmh
which contains the fields
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
BP Physical position (basepair)
CHISQ CochranMantelHaenszel statistic (1df)
P Asymptotic pvalue for CMH test
OR CMH odds ratio
L95 Lower bound on confidence interval for CMH odds ratio
U95 Upper bound on confidence interval for CMH odds ratio
The range of the confidence interval with the mh option can be
changed with the ci option:
plink file mydata mh within mycluster.dat ci 0.99
The mh2 option generates the file
plink.cmh2
which contains the fields:
CHR Chromosome
SNP SNP identifier
CHISQ_CMH2 CochranMantelHaenszel test for IxJxK tables
P_CMH2 Asymptotic pvalue for this test
It is not possible to obtain confidence intervals or
odds ratios for mh2 tests.
Hint A trick to analyse phenotypes with more two
categories (but only with nominal, not ordinal outcomes) is to use the
mh2 option with the
phenotype in the cluster file and the phenotype in the PED file set all
to a single value.
Testing for heterogeneous association
As mentioned in the previous section, two methods are provided to test
for betweencluster differences in association when using a case/control
design. The BreslowDay test is specified with the option:
plink file mydata bd within myclst.txt
which runs and generates the same files as the mh
option, described above, but with two extra fields appended:
CHISQ_BD BreslowDay test
P_BD Asymptotic pvalue
where a significant value indicates betweencluster heterogeneoty
in the odds ratios for the disease/SNP association.
A similar test of the homogeneity of odds ratio tests based on
partitioning the chisquare statistic is given by:
plink file mydata homog within myclst.txt
which generates the file
plink.homog
which contains the fields
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
F_A Case allele frequency
F_U Control allele frequency
N_A Case allele count
N_U Control allele count
TEST Type of test
CHISQ Chisquared association statistic
DF Degrees of freedom
P Asymptotic pvalue
OR Odds ratio
The TEST type is either
TOTAL Total SNP & strata association
ASSOC SNP association controlling for strata
HOMOG Betweenstrata heterogeneity test
X_1 Association in first stratum
X_2 Association in second stratum
...
Hotelling's T(2) multilocus association test
IMPORTANT This command has been temporarily disabled
For diseasetraits, PLINK provides support for a
multilocus, genotypebased test using Hotelling's T2 (Tsquared)
statistic. The set option should be used to specify
which SNPs are to be grouped, as follows:
plink file data set mydata.set T2
where mydata.set defines which SNPs are in which set (see
this section for more information on
defining sets).
This command will generate a file
plink.T2
which contains the fields
SET Set name
SIZE Number of SNPs in this set
F Fstatistic from Hotelling's test
DF1 Degrees of freedom 1
DF2 Degrees of freedom 2
P_HOTEL Asymptotic pvalue
HINT Use the genedrop permutation to
perform a familybased application of the Hotelling's T2 test.
This command can be used with all permutation methods (labelswapping
or genedropping, adaptive or max(T)). In fact, the permutation test
is based on 1p in order to make the between set comparisons for the
max(T) statistic more meaningful (as different sized sets would have
Fstatistics with different degrees of freedom otherwise). Using
permutation will generate one of the following files:
plink.T2.perm
which contain the fields
SET Set name
SIZE Number of SNPs in this set
EMP1 Empirical pvalue
NR Number of permutation replicates
or, if mperm was used,
plink.T2.mperm
which contain the fields
SET Set name
SIZE Number of SNPs in this set
EMP1 Empirical pvalue
EMP2 max(T) empirical pvalue
Note that this test uses a simple approach to missing data: rather
than casewise deletion (removing an individual if they have at least
one missing observation) we impute the mean allelic value. Although
this retains power under most scenarios, it can also cause some bias
when there are lots of missing data points. Using permutation is a
good way around this issue.
Quantitative trait association
Quantitative traits can be tested for association also, using either
asymptotic (likelihood ratio test and Wald test) or empirical
significance values. If the phenotype (column 6 of the PED file or the
phenotype as specified with the pheno option) is
quantitative (i.e. contains values other than 1, 2, 0 or missing)
then PLINK will automatically treat the analysis as a
quantitative trait analysis. That is, the same command as for
diseasetrait association:
plink file mydata assoc
will generate the file
plink.qassoc
with fields as follows:
CHR Chromosome number
SNP SNP identifier
BP Physical position (basepair)
NMISS Number of nonmissing genotypes
BETA Regression coefficient
SE Standard error
R2 Regression rsquared
T Wald test (based on tdistribtion)
P Wald test asymptotic pvalue
If permutations were also requested, then an extra file, either
plink.assoc.perm
or
plink.assoc.mperm
will be generated, depending on whether adaptive or max(T) permutation
was used (see the next section for more
details). The empirical pvalues are based on the Wald statistic.
Genotype means for quantitative traits
Adding the flag qtmeans along with the assoc
command, when run with a quantitative trait, will produce an
additional file with a list of means and standard deviations
stratified by genotype, called
plink.qassoc.means
and format
CHR Chromosome code
SNP SNP identifier
VALUE Description of next three fields
G11 Value for first genotype
G12 Value for second genotype
G22 Value for third genotype
where VALUE is one of GENO, COUNTS, FREQ,
MEAN or SD (standard deviation). For example:
CHR SNP VALUE G11 G12 G22
5 hCV26311749 GENO 2/2 2/1 1/1
5 hCV26311749 COUNTS 1 60 597
5 hCV26311749 FREQ 0.00152 0.09119 0.9073
5 hCV26311749 MEAN 0.9367 0.4955 0.5074
5 hCV26311749 SD 0 0.273 0.2902
5 hCV918000 GENO 2/2 2/1 1/1
5 hCV918000 COUNTS 47 237 359
5 hCV918000 FREQ 0.07309 0.3686 0.5583
5 hCV918000 MEAN 0.505 0.5091 0.5074
5 hCV918000 SD 0.2867 0.3064 0.2797
i.e. each SNP takes up 5 rows.
Quantitative trait interaction (GxE)
PLINK provides the ability to test for a difference in
association with a quantitative trait between two environments (or, more
generally, two groups). This test is simply based on comparing the
difference between two regression coefficients. To perform this test:
plink file mydata gxe covar mycov.dat
where mycovar.txt is a file containing the following fields:
Family ID
Individual ID
Covariate value
See the notes on covariate files for more
details.
This option will generate the file
plink.qassoc.gxe
which contains the fields:
CHR Chromosome number
SNP SNP identifier
NMISS1 Number of nonmissing genotypes in first group (1)
BETA1 Regression coefficient in first group
SE1 Standard error of coefficient in first group
NMISS2 As above, second group
BETA2 As above, second group
SE2 As above, second group
Z_GXE Z score, test for interaction
P_GXE Asymptotic pvalue for this test
IMPORTANT! The covariate must be coded as an affection
status variable, i.e. 1 or 2 representing the first or second group.
Values of 0 or 9 can be used to indicate missing covariate values, in
which case that individual will be excluded from analysis.
Linear and logistic models
These two features allow for multiple covariates when testing for both
quantitative trait and disease trait SNP association, and for
interactions with those covariates. The covariates can either
be continuous or binary (i.e. for categorical covariates,
you must first make a set of binary dummy variables).
WARNING! These commands are in some ways more flexible than
the standard assoc command, but this comes with a price: namely,
these run more slowly...
In this section we consider:
 Basic uasge
 Covariate and interactions
 Flexibly specifying the precise model
 Flexibly specifying joint tests
Basic usage
For quantitative traits, use
plink bfile mydata linear
For disease traits, specify logistic regression with
plink bfile mydaya logistic
instead. All other commands in this section apply equally to both these models.
These commands will either generate the output file
plink.assoc.linear
or
plink.assoc.logistic
depending on the phenotype/command used. The basic format is:
CHR Chromosome
SNP SNP identifier
BP Physical position (basepair)
A1 Tested allele (minor allele by default)
TEST Code for the test (see below)
NMISS Number of nonmissing individuals included in analysis
BETA/OR Regression coefficient (linear) or odds ratio (logistic)
STAT Coefficient tstatistic
P Asymptotic pvalue for tstatistic
For the additive effects of SNPs, the direction of the regression
coefficient represents the effect of each extra minor allele
(i.e. a positive regression coefficient means that the minor allele
increases risk/phenotype mean). If the beta command is
added along with logistic, then the regression coefficients
rather than the odds ratios will be returned.
NOTE Elsewhere in this documentation, the term
reference allele is sometimes used to refer to A1,
i.e. the referenceallele command can be used to specify
which allele is A1. Note that in association testing, the odds ratios,
etc are typically calculated with A2 as the actual reference allele
(i.e. a positive OR means A1 increases risk relative to A2).
HINT Adding the ci 0.95, for example,
option will given 95% confidence intervals for the estimated
parameters, in additional L95 and U95 fields in the
output files.
By itself, the linear command will give identical results to the
Wald test from the assoc command when applied to quantitative
traits.The logistic command may give slightly different
results to the assoc command for disease traits, but this is
because a different test/model is being applied (i.e. logistic regression
rather than allele counting). The difference may be particularly large for
very rare alleles (i.e. if the SNP is monomorphic in cases or controls,
then the logistic regression model is not welldefined and asymptotic
results might not hold for the basic test either).
When using linear, adding the option
standardbeta
will first standard the phenotype (mean 0, unit variance), so the
resulting coefficients will be standardized.
The TEST column is by default ADD meaning the additive
effects of allele dosage. Adding the option
genotypic
will generate file which will have two extra tests per SNP, corresponding
to two extra rows: DOMDEV and GENO_2DF which represent a
separate test of the dominance component or a 2 df joint test of both
additive and dominance (i.e. corresponding the the general, genotypic
model in the model command). Unlike the dominance model is the
model, DOMDEV refers to a variable coded 0,1,0
for the three genotypes AA,Aa,aa, i.e. representing the
dominance deviation from additivity, rather specifying that a
particular allele is dominant or recessive. That is, the DOMDEV
term is fitted jointly with the ADD term in a single model.
NOTE! The coding PLINK uses with the 2
df genotypic model involves two variables representing an
additive effect and a dominance deviation;
A D
AA 0 0
AB 1 1
BB 2 0
Although the 2df test will be identical, you would not
expect to see similar pvalues, etc for the two individual terms if
instead you used a different version of "genotypic" coding, e.g. in
another analysis package, such as using dummy variables to represent
genotypes:
G1 G2
AA 0 0
AB 1 0
BB 0 1
That is, although fundamentally the same, in terms of the 2df test,
the interpretation of the two individual terms is different in these
two cases. To achieve this coding in PLINK (v1.02 onwards), add
the hethom flag as well as genotypic.
In a related note, you would not always expect the ADD
pvalue to be the same when entering in the dominance term as it is
without it; if in doubt, you are advised to stick to just interpreting
the 2 df test if using the genotypic option.
To specify a model assuming full dominance (or recessive) for the
minor allele (i.e. rather than the 2 df model mentioned above), you
can specify with either
dominant
or
recessive
Covariates and interactions
If a covariate file is also specified, then all covariates in that
file will be included in the regression model, labelled COV1, COV2, etc.
This is different to other commands which take only a single covariate (possibly
working in conjunction with the mcovar option).
NOTE The covarname or covarnumber commands can be used
to select a subset of all covariates in the file, described here.
For example, if the covariate file is made as
described here and contains 2
covariates then the command
plink bfile mydata linear genotypic covar mycov.txt
will add two extra tests per SNP, COV1 and COV2. The pvalue for the SNP term or terms
in the model will be adjusted for the covariates; that is, a single model is fit to the data (that also includes
a dominance term, as the genotypic flag was also set):
Y = b0 + b1.ADD + b2.DOMDEV + b3.COV1 + b4.COV2 + e
(Note, using this notation, the genotypic test is of b1=b2=0.)
The output per each SNP might look something like:
CHR SNP BP A1 TEST NMISS OR STAT P
5 rs000001 10001 A ADD 664 0.7806 1.942 0.05216
5 rs000001 10001 A DOMDEV 664 0.9395 0.3562 0.7217
5 rs000001 10001 A COV1 664 0.9723 0.7894 0.4299
5 rs000001 10001 A COV2 664 1.159 0.5132 0.6078
5 rs000001 10001 A GENO_2DF 664 NA 5.059 0.0797
That is, this represent coefficients from four terms in a multiple
regression of disease on ADD,
DOMDEV, COV1 and COV2 jointly. The final
test is a 2df test that tests the coefficients for ADD
and DOMDEV together. Importantly, the pvalues for each line
reflect the effect of the entity under the TEST column, not
of the SNP whilst controlling for that particular covariate. (That is, p=0.0797
is the 2df test of the SNP whilst controlling for COV1 and COV2.)
HINT To suppress the multiple lines of output for
each covariate (which often are not of interest in themselves) add the
flag hidecovar, i.e. the above would just read as follows
for this SNP:
CHR SNP BP A1 TEST NMISS OR STAT P
5 rs000001 10001 A ADD 664 0.7806 1.942 0.05216
5 rs000001 10001 A GENO_2DF 664 NA 5.059 0.0797
HINT To condition analysis on a specific SNP when using linear or logistic,
use the condition option, e.g.
plink bfile mydata linear condition rs123456
will test all SNPs but adding the allelic dosage for rs123456 as a covariate. This
command can be used in conjunction with covar and the other options listed here.
To condition on multiple SNPs, use, for example,
plink bfile mydata linear conditionlist snps.txt where snps.txt is a plain text file
contain a list of SNPs which are to be included as covariates. The output will now include terms that correspond
to the SNPs listed in the file snps.txt.
The conditioning SNPs are entered into the model simply as covariates,
using a simple 0, 1, 2 allele dosage coding. That is, for two conditioning SNPs, rs1001 and rs1002 say,
and also a standard covariate, the model would be
Y = b0 + b1.ADD + b2.rs1001 + b3.rs1002 + b4.COV1 + e
If the b1 coefficient for the test SNP is still significant after entering these covariates,
this would suggest that it does indeeed have an effect independent of rs1001, rs1002 and
the other covariate. (The other coefficients may still be highly significant, but these reflect the effects of
the conditioning SNPs and covariates, not the test SNP.)
If the sex flag is added, then sex will be entered as a
covariate in the model (coded 1 for male, 0 for female), e.g
plink bfile mydata logistic sex
If the option interaction is added, then terms will be entered
which correspond to SNP x covariate interactions (with DOMDEV
as well as ADD if genotypic is specified). In the
case of two covariates, without genotypic, for example, the
command
plink bfile mydata linear covar tmp.cov interaction
results in the model
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e
NOTE Please remember that when interaction terms
are included in the model, the significance of the main effects can not
necessarily be interpreted straightforwardly (i.e. they will depend
on the arbitrary coding of the variables). In otherwords, when
including the interaction flag, you should probably only
interpret the interaction pvalue. Please refer to any standard text of
regression models if you are unclear on this.
Finally, a testall option drops all the terms in the model
in a multiple degree of freedom test.
Flexibly specifying the model
Use command such as covar and interaction will automatically
enter all covariates and possible SNP x covariate interactions. If one does not want to
test all of these, then use the parameters flag to extract only the ones of
interest.
For example, to take the example above:
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e
If one only wanted ADD, the two covariates and the ADDxCOV2
but not the ADDxCOV1 interaction, then, from the above example,
you could use
plink bfile mydata linear covar tmp.cov interaction parameters 1,2,3,5
That is, parameters takes a commanseparated list of integers, starting from 1,
that represent the terms in the model (in the order in which they would appear if the
command were run without the parameters flag). In this case:
ADD [1]
COV1 [2]
COV2 [3]
ADD x COV1 [4] < excluded
ADD x COV2 [5]
Flexibly specifying joint tests
To perform a userdefined joint test of more than one parameter, use the
tests option. This takes a commadelimited set of parameter
numbers, for example: if the model is
ADD [1]
COV1 [2]
COV2 [3]
ADDxCOV1 [4]
ADDxCOV2 [5]
then
plink bfile mydate linear covar file.cov interaction tests
1,4,5
represeents a 3 degree of freedom test of ADD and the two
interactions.
Note, if this is used in conjunction with the parameters
option, then the coding here refers to the reduced model  for example,
the command
plink bfile mydate linear covar file.cov interaction
parameters 1,2,3,5 tests 1,4
performs a joint test of ADD and ADDxCOV2 (2df test)
whilst controlling for main effects of COV1 and COV2,
i.e. we do not use tests 1,5, as there are now only 4
terms in the model:
parameters 1,2,3,5 tests 1,4
ADD [1] [1] TEST
COV1 [2] [2]
COV2 [3] [3]
ADDxCOV1 [4] n/a
ADDxCOV2 [5] [4] TEST
In other words, we fit the model
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV2 + e
and jointly test the hypothesis
H0: b1 = b4 = 0
As mentioned above, use testall to drop all terms in the model
in a single joint test.
Multicollinearity
A common problem with multiple regression is that of multicollinearity:
when the predictor variables are too strongly correlated to each other,
the parameter estimates will become unstable. PLINK tries to
detect this, and will display NA for the test statistic and
pvalue for all terms in the model if there is evidence of
multicollinearity. One common instance where this would occur would be
if one includes the genotypic option but a SNP only has two of
the three possible genotype classes: in this case, ADD and
DOM will be perfectly correlated and PLINK will
display NA for both tests; this is basically telling you that
you should rerun without the genotypic option for that
particular SNP. Similar principles apply to including covariates and
interactions terms: the more terms you include, the more likely you are
to have problems.
The vif option can be used to specify the variance
inflation factor (VIF) used in the initial test for
multicollinearity. The default value is 10  smaller values represent
more stringent tests.
HINT If you have a quantitative trait, only want an additive model
and have only a single binary covariate, use the gxe option (described above)
instead of linear: it will run much faster (being based on a more simple
test of the difference of two regression slopes; it will not necessarily give
numerically identical results to the multiple regression approach, but asymptotically both
tests should be similar).
Setbased tests
These setbased tests are particularly suited to largescale candidate
gene studies as opposed to whole genome association studies, as they
use permutaiton.
NOTE The basis of the setbased test has been changed in version 1.04 onwards.
This analysis works as follows:
 For each set, for each SNP determine which other SNPs are in LD,
above a certain threshold R
 Perform standard single SNP analysis (which might be basic
case/control association, familybased TDT or quantitative trait
analysis).
 For each set, select up to N "independent" SNPs (as
defined in step 1) with pvalues below P. The best SNP is
selected first; subsequent SNPs are selected in order of descreasing
statistical significance, after removing SNPs in LD with previously
selected SNPs.
 From these subsets of SNPs, the statistic for each set is calculated
as the mean of these single SNP statistics
 Permute the dataset a large number of times, keeping LD between
SNPs constant (i.e. permute phenotype labels)
 For each permuted dataset, repeat steps 2 to 4 above.
 Empirical pvalue for set (EMP1) is the number of times
the permuted setstatistic exceeds the original one for that set.
Note that the empirical pvalues are corrected for the multiple SNPs
within a set (taking account of the LD between these SNPs). They are
not corrected for multiple testing if there is more than one set,
however (i.e. there is no equivalent of EMP2 (see the page
on permutation).
The critical parameters described above, R, N
and P can all be altered by the user, as described below.
To perform a setbased test the critical keywords are
settest
set my.set
mperm 10000
which state that we are performing a setbased test, which setfile to
use and how many permutations to perform (this last command is
necessary). As mentioned above, the assoc command could be
replaced by tdt, or logistic, etc.
The set file my.set is in form
SET1
rs1234
rs28384
rs29334
END
SET2
rs4774
rs662662
rs77262
END
...
For example,
plink file mydata settest set my.set mperm 10000 assoc
would display in the LOG file the following critical parameters with
their default values
Performed LDbased set test, with parameters:
rsquared (setr2) = 0.5
pvalue (setp) = 0.05
max # SNPs (setmax) = 5
The output is written to a file with a .set.mperm extension, for example
plink.assoc.set.mperm
with the fields
SET Set name
NSNP Number of SNPs in set
NSIG Total number of SNPs below pvalue threshold
ISIG Number of significant SNPs also passing LDcriterion
STAT Average test statistic based on ISIG SNPs
EMP1 Empirical setbased pvalue
SNPS List of SNPs in the set
For example, here is output from a case/control dataset with SNPs for
five related genes (lines truncated)
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 0 0 0 1 NA
GABRA6 6 4 3 5.199 0.09489 rs3811991rs2197414...
GABRA1 22 11 5 5.951 0.09459 rs4254937rs4260711...
GABRG2 24 0 0 0 1 NA
GABRP 17 2 1 7.64 0.0269 rs7736504
Here the first gene, GABRB2 has 45 SNPs, but none of these
are significant at p=0.05, and so the empirival pvalue is necessarily
1.00. The next gene has 6 SNPs, 4 of which are significant, but only 3
of which are independently significant based on an rsquared threshold
of 0.5. The STAT of 5.199 is the average chisquared
statistic across these three SNPs. It should not be interpreted in
itself  rather, you should consider the EMP1 significance
value based on it. In this case, P=0.095. The final
gene, GABRP is nominally significant here, P=0.027, but this
does not correct for the 5 genes tested of course.
Naturally, different thresholds will produce different
results. Depending on the unknown genetic architecture, these may vary
substantially and meaningfully so. In general, if the set represents a
very large pathway (dozens of genes) you might want to
increase setmax. There are probably no hard and fast rules
with regard to how to set setp and setr2,
except to say that running under a large number of settings and
selecting the most significant is not a good idea.
Running with a "stricter" set of values
setr2 0.1
setp 0.01
setmax 2
we see a broadly similar pattern of results; naturally, the
thresholding on pvalue means that GABRA6 goes from showing
some signal to asbolutely no signal.
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 0 0 0 1 NA
GABRA6 6 0 0 0 1 NA
GABRA1 22 2 2 7.464 0.05949 rs4254937rs4260711
GABRG2 24 0 0 0 1 NA
GABRP 17 1 1 7.64 0.06309 rs7736504
Alternatively, a more inclusive setting might be something like
setr2 0.8
setp 1
setmax 10
which, in this particular case, happens to yield slightly stronger
signals for GABRA6 and GABRA1 but weaker
for GABRp (lines truncated)
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 12 10 1.749 0.7162 hCV26311691...
GABRA6 6 6 6 3.998 0.0184 rs3811991...
GABRA1 22 13 10 5.277 0.0182 rs4254937...
GABRG2 24 11 10 0.6976 0.9099 hCV3167705...
GABRP 17 10 10 2.753 0.1225 rs7736504...
HINT Two extremes are to perform a test based on a) the best single SNP result per set:
setmax 1
setp 1
or to use all SNPs in a set:
setmax 99999
setp 1
setr2 1
Adjustment for multiple testing: Bonferroni, Sidak, FDR, etc
To generate a file of adjusted significance values that correct for
all tests performed and other metrics, use the option:
plink file mydata assoc adjust
which generates the file
plink.adjust
which contains the fields
CHR Chromosome number
SNP SNP identifer
UNADJ Unadjusted pvalue
GC Genomiccontrol corrected pvalues
BONF Bonferroni singlestep adjusted pvalues
HOLM Holm (1979) stepdown adjusted pvalues
SIDAK_SS Sidak singlestep adjusted pvalues
SIDAK_SD Sidak stepdown adjusted pvalues
FDR_BH Benjamini & Hochberg (1995) stepup FDR control
FDR_BY Benjamini & Yekutieli (2001) stepup FDR control
This file is sorted by significance value rather than genomic location, the
most significant results being at the top.
WARNING Currently, these procedures are only
implemented for asymptotic significance values for the standard TDT
and association (disease trait and quantitative
trait, assoc, linear, logistic) tests
and the 2x2xK CochranMantelHaenszel test. Future versions will
allow these results for empirical significance values and for other
tests (e.g. epistasis, etc).

