1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Association analysis
The basic association test is for a disease trait and is based on
comparing allele frequencies between cases and controls (asymptotic
and empirical p-values are available). Also implemented are the
Cochran-Armitage trend test, Fisher's exact test, different genetic models
(dominant, recessive and general), tests for stratified samples (e.g.
Cochran-Mantel-Haenszel, Breslow-Day tests), a test for a quantitative
trait; a test for differences in missing genotype rate between cases and
controls; multilocus tests, using either Hotelling's T(2) statistic or
a sum-statistic approach (evaluated by permutation) as well as haplotype tests. The basic
tests can be performed with permutation, described in the
following section to provide empirical
p-values, and allow for different designs (e.g. by use of structured,
within-cluster permutation). Family-based tests are described in
the next section
HINT The basic association commands (--assoc,
--model, --fisher, --linear and --logistic) will
test only a single phenotype. If your alternate phenotype file contains more than one
phenotype, then adding the --all-pheno flag will make PLINK cycle over each
phenotype, e.g. instead of a single plink.assoc output file, if there are
100 phenotypes, PLINK will now show
plink.P1.assoc
plink.P2.assoc
...
plink.P100.assoc
Naturally, it will take 100 times longer... If you are testing a very large number
of phenotypes, it might be worth specifying --pfilter also, to reduce the
amount of amount (e.g. only outputing tests significant at p=1e-4 if --pfilter
1e-4 is specified).
Basic case/control association test
To perform a standard case/control association analysis, use the option:
plink --file mydata --assoc
which generates a file
plink.assoc
which contains the fields:
CHR Chromosome
SNP SNP ID
BP Physical position (base-pair)
A1 Minor allele name (based on whole sample)
F_A Frequency of this allele in cases
F_U Frequency of this allele in controls
A2 Major allele name
CHISQ Basic allelic test chi-square (1df)
P Asymptotic p-value for this test
OR Estimated odds ratio (for A1, i.e. A2 is reference)
Hint In addition, if the optional command
--ci X (where X is the desired coverage for a
confidence interval, e.g. 0.95 or 0.99) is included, then two extra
fields are appended to this output:
L95 Lower bound of 95% confidence interval for odds ratio
U95 Upper bound of 95% confidence interval for odds ratio
(where 95 would change if a different value was used with the
--ci option, naturally).
Adding the option
--counts
with --assoc will make PLINK report allele counts, rather than frequencies, in cases and controls.
See the next section on permutation to learn how
to generate empirical p-values and use other aspects of permutation-based
testing.
See the section on multimarker tests to learn how to perform haplotype-based
tests of association.
This analysis should appropriately handle X/Y chromosome SNPs automatically.
Fisher's Exact test (allelic association)
To perform a standard case/control association analysis using
Fisher's exact test to generate significance, use the option:
plink --file mydata --fisher
which generates a file
plink.fisher
which contains the fields:
CHR Chromosome
SNP SNP ID
BP Physical position (base-pair)
A1 Minor allele name (based on whole sample)
F_A Frequency of this allele in cases
F_U Frequency of this allele in controls
A2 Major allele name
P Exact p-value for this test
OR Estimated odds ratio (for A1)
As described below, if --fisher is specified with --model as well,
PLINK will perform genotypic tests using Fisher's exact test.
Note You can also use permutation to generate exact,
empirical significance values that would also be valid in small samples,
etc.
Alternate / full model association tests
It is possible to perform tests of association between a disease and a variant other than the
basic allelic test (which compares frequencies of alleles in cases versus controls), by using
the --model option. The tests offered here are (in addition to the basic allelic test):
- Cochran-Armitage trend test
- Genotypic (2 df) test
- Dominant gene action (1df) test
- Recessive gene action (1df) test
One advantage of the Cochran-Armitage test is that it does not assume Hardy-Weinberg equilibrium,
as the individual, not the allele, is the unit of analysis (although the permutation-based empirical
p-values from the basic allelic test also have this property). It is important to remember that SNPs
showing severe deviations from Hardy-Weinberg are often likely to be bad
SNPs, or reflect stratification in the sample, however, and so are probably
best excluded in many cases.
The genotypic test provides a general test of association in the 2-by-3 table of disease-by-genotype. The
dominant and recessive models are tests for the minor allele (which is the
minor allele can be found in the output of either the --assoc
or the --freq commands. That is, if D is the minor
allele (and d is the major allele):
Allelic: D versus d
Dominant: (DD, Dd) versus dd
Recessive: DD versus (Dd, dd)
Genotypic: DD versus Dd versus dd
As mentioned above, these tests are generated with option:
plink --file mydata --model
which generates a file
plink.model
which contains the following fields:
CHR Chromosome number
SNP SNP identifier
TEST Type of test
AFF Genotypes/alleles in cases
UNAFF Genotypes/alleles in controls
CHISQ Chi-squated statistic
DF Degrees of freedom for test
P Asymptotic p-value
Each SNP will feature on five rows of the output, correspondnig to the
five tests applied. The column TEST refers to
either ALLELIC, TREND, GENO,
DOM or REC, refering to the different types of test
mentioned above. The genotypic or allelic counts are given for cases
and controls separately. For recessive and dominant tests, the counts
represent the genotypes, with two of the classes pooled.
These tests only consider diploid genotypes: that is, for the X
chromosome males will be excluded even from the ALLELIC test. This way
the same data are used for the five tests presented here. Note that,
in contrast, the basic association commands (--assoc
and --linear, etc) include single male X chromosomes, and so the
results may differ.
The genotypic and dominant/recessive tests will only be conducted if
there is a minimum number of observations per cell in the 2-by-3
table: by default, if at least one of the cells has a frequency less
than 5, then we skip the alternate tests (NA is written in
the results file). The Cochran-Armitage and allelic tests are
performed in all cases. This threshold can be altered with the
--cell option:
plink --file mydata --model --cell 20
If permutation (with the --mperm or --perm options)
is specified, the -model option will by default perform a
permutation test based on the most significant result
of ALLELIC, DOM and REC models. That is,
for each SNP, the best original result will be compared against the
best of these three tests for that SNP for every replicate. In max(T)
permutation mode, this will also be compared against the best result
from all SNPs for the EMP2 field. This procedure controls for
the fact that we have selected the best out of three correlated tests
for each SNP. The output will be generated in the file
plink.model.best.perm
or
plink.model.best.mperm
depending on whether adaptive or max(T) permutation was used.
The behavior of the --model command can be changed by adding
the --model-gen, --model-trend, --model-dom
or --model-rec flags to make the permutation use the
genotypic, the Cochram-Armitage trend test, the dominant test or the
recessive test as the basis for permutation instead. In this case, one
of the the following files will be generated:
plink.model.gen.perm plink.model.gen.mperm
plink.model.trend.perm plink.model.trend.mperm
plink.model.dom.perm plink.model.dom.mperm
plink.model.rec.perm plink.model.rec.mperm
It is also possible to add the --fisher flag to obtain exact p-values:
./plink --bfile mydata --model --fisher
in which case the CHISQ field does not appear. Note
that the genotypic, allelic, dominant and recessive models use the
Fisher's exact; the trend-test does not and will give the same p-value as without
the --fisher flag. Also, by default, when --fisher is added, the
--cell field is set to 0, i.e. to include all SNPs.
Stratified analyses
When a cluster variable has been specified, by pointing to a file
that contains this information, with the --within command,
it is possible to perform a number of tests of case/control association
that take this clustering into account, or explicitly test for
homogeneity of effect between clusters.
Note In many cases, permutation procedures can also
be used to account for clusters in
the data. See the next section for more details.
The tests presented below
are only applicable for case/control data, so permutation might be useful
for quantitative trait outcomes,
etc.
There are two basic classes of test:
- Testing for overall disease/gene association, controlling for clusters
- Testing for heterogeneity of the disease/gene assocation between different clusters
The type of cluster structure will vary in terms of how many clusters
there are in the sample, and how many
people belong to each cluster. At one extreme, we might have two only 2
clusters in the sample, each with
a large number of cases and controls. At the other extreme, we might have
a very large number of clusters,
such that each cluster only has 2 individuals. These factors will
influence the choice of stratified
analysis.
The tests offered are:
- Cochran-Mantel-Haenszel test for 2x2xK stratified tables
- Cochran-Mantel-Haenszel test for IxJxK stratified tables
- Breslow-Day test of homogeneity of odds ratio
- Partitioning the total association chi-square to perform between and within cluster
association, and a test of homogeneity of effect
The Cochran-Mantel-Haenszel (CMH) tests are valid with both a large
number of small clusters and a small
number of large clusters. These tests provide a test based on an
"average" odds ratio that controls for the potential confounding due to
the cluster variable.
The Breslow-Day test asks whether different clusters have different
disease/gene odds ratios: this test
assumes a moderate sample size within each cluster. The partitioning
total association test, which is
conceptually similar to the Breslow-Day test, also makes the same
assumption.
As mentioned above, the CMH test comes in two flavours: 2x2xK and
IxJxK. Currently, the 2x2xK test represents a { disease x SNP |
cluster } test. The generalized form, the IxJxK, represents a
test of { cluster x SNP | disease }, i.e. does the SNP vary
between clusters, controlling for any possible true SNP/disease
association. This latter test might be useful in interpreting
significant associations in stratified samples. Typically, the first
form of the test will be of more interest, however. These two tests
are run by using the options:
plink --file mydata --mh --within mycluster.dat
for the basic CMH test, or
plink --file mydata --mh2 --within mycluster.dat
for the IxJxK test.
The --mh option generates the file
plink.cmh
which contains the fields
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
BP Physical position (base-pair)
CHISQ Cochran-Mantel-Haenszel statistic (1df)
P Asymptotic p-value for CMH test
OR CMH odds ratio
L95 Lower bound on confidence interval for CMH odds ratio
U95 Upper bound on confidence interval for CMH odds ratio
The range of the confidence interval with the --mh option can be
changed with the --ci option:
plink --file mydata --mh --within mycluster.dat --ci 0.99
The --mh2 option generates the file
plink.cmh2
which contains the fields:
CHR Chromosome
SNP SNP identifier
CHISQ_CMH2 Cochran-Mantel-Haenszel test for IxJxK tables
P_CMH2 Asymptotic p-value for this test
It is not possible to obtain confidence intervals or
odds ratios for --mh2 tests.
Hint A trick to analyse phenotypes with more two
categories (but only with nominal, not ordinal outcomes) is to use the
--mh2 option with the
phenotype in the cluster file and the phenotype in the PED file set all
to a single value.
Testing for heterogeneous association
As mentioned in the previous section, two methods are provided to test
for between-cluster differences in association when using a case/control
design. The Breslow-Day test is specified with the option:
plink --file mydata --bd --within myclst.txt
which runs and generates the same files as the --mh
option, described above, but with two extra fields appended:
CHISQ_BD Breslow-Day test
P_BD Asymptotic p-value
where a significant value indicates between-cluster heterogeneoty
in the odds ratios for the disease/SNP association.
A similar test of the homogeneity of odds ratio tests based on
partitioning the chi-square statistic is given by:
plink --file mydata --homog --within myclst.txt
which generates the file
plink.homog
which contains the fields
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
F_A Case allele frequency
F_U Control allele frequency
N_A Case allele count
N_U Control allele count
TEST Type of test
CHISQ Chi-squared association statistic
DF Degrees of freedom
P Asymptotic p-value
OR Odds ratio
The TEST type is either
TOTAL Total SNP & strata association
ASSOC SNP association controlling for strata
HOMOG Between-strata heterogeneity test
X_1 Association in first stratum
X_2 Association in second stratum
...
Hotelling's T(2) multilocus association test
IMPORTANT This command has been temporarily disabled
For disease-traits, PLINK provides support for a
multilocus, genotype-based test using Hotelling's T2 (T-squared)
statistic. The --set option should be used to specify
which SNPs are to be grouped, as follows:
plink --file data --set mydata.set --T2
where mydata.set defines which SNPs are in which set (see
this section for more information on
defining sets).
This command will generate a file
plink.T2
which contains the fields
SET Set name
SIZE Number of SNPs in this set
F F-statistic from Hotelling's test
DF1 Degrees of freedom 1
DF2 Degrees of freedom 2
P_HOTEL Asymptotic p-value
HINT Use the --genedrop permutation to
perform a family-based application of the Hotelling's T2 test.
This command can be used with all permutation methods (label-swapping
or gene-dropping, adaptive or max(T)). In fact, the permutation test
is based on 1-p in order to make the between set comparisons for the
max(T) statistic more meaningful (as different sized sets would have
F-statistics with different degrees of freedom otherwise). Using
permutation will generate one of the following files:
plink.T2.perm
which contain the fields
SET Set name
SIZE Number of SNPs in this set
EMP1 Empirical p-value
NR Number of permutation replicates
or, if --mperm was used,
plink.T2.mperm
which contain the fields
SET Set name
SIZE Number of SNPs in this set
EMP1 Empirical p-value
EMP2 max(T) empirical p-value
Note that this test uses a simple approach to missing data: rather
than case-wise deletion (removing an individual if they have at least
one missing observation) we impute the mean allelic value. Although
this retains power under most scenarios, it can also cause some bias
when there are lots of missing data points. Using permutation is a
good way around this issue.
Quantitative trait association
Quantitative traits can be tested for association also, using either
asymptotic (likelihood ratio test and Wald test) or empirical
significance values. If the phenotype (column 6 of the PED file or the
phenotype as specified with the --pheno option) is
quantitative (i.e. contains values other than 1, 2, 0 or missing)
then PLINK will automatically treat the analysis as a
quantitative trait analysis. That is, the same command as for
disease-trait association:
plink --file mydata --assoc
will generate the file
plink.qassoc
with fields as follows:
CHR Chromosome number
SNP SNP identifier
BP Physical position (base-pair)
NMISS Number of non-missing genotypes
BETA Regression coefficient
SE Standard error
R2 Regression r-squared
T Wald test (based on t-distribtion)
P Wald test asymptotic p-value
If permutations were also requested, then an extra file, either
plink.assoc.perm
or
plink.assoc.mperm
will be generated, depending on whether adaptive or max(T) permutation
was used (see the next section for more
details). The empirical p-values are based on the Wald statistic.
Genotype means for quantitative traits
Adding the flag --qt-means along with the --assoc
command, when run with a quantitative trait, will produce an
additional file with a list of means and standard deviations
stratified by genotype, called
plink.qassoc.means
and format
CHR Chromosome code
SNP SNP identifier
VALUE Description of next three fields
G11 Value for first genotype
G12 Value for second genotype
G22 Value for third genotype
where VALUE is one of GENO, COUNTS, FREQ,
MEAN or SD (standard deviation). For example:
CHR SNP VALUE G11 G12 G22
5 hCV26311749 GENO 2/2 2/1 1/1
5 hCV26311749 COUNTS 1 60 597
5 hCV26311749 FREQ 0.00152 0.09119 0.9073
5 hCV26311749 MEAN 0.9367 0.4955 0.5074
5 hCV26311749 SD 0 0.273 0.2902
5 hCV918000 GENO 2/2 2/1 1/1
5 hCV918000 COUNTS 47 237 359
5 hCV918000 FREQ 0.07309 0.3686 0.5583
5 hCV918000 MEAN 0.505 0.5091 0.5074
5 hCV918000 SD 0.2867 0.3064 0.2797
i.e. each SNP takes up 5 rows.
Quantitative trait interaction (GxE)
PLINK provides the ability to test for a difference in
association with a quantitative trait between two environments (or, more
generally, two groups). This test is simply based on comparing the
difference between two regression coefficients. To perform this test:
plink --file mydata --gxe --covar mycov.dat
where mycovar.txt is a file containing the following fields:
Family ID
Individual ID
Covariate value
See the notes on covariate files for more
details.
This option will generate the file
plink.qassoc.gxe
which contains the fields:
CHR Chromosome number
SNP SNP identifier
NMISS1 Number of non-missing genotypes in first group (1)
BETA1 Regression coefficient in first group
SE1 Standard error of coefficient in first group
NMISS2 As above, second group
BETA2 As above, second group
SE2 As above, second group
Z_GXE Z score, test for interaction
P_GXE Asymptotic p-value for this test
IMPORTANT! The covariate must be coded as an affection
status variable, i.e. 1 or 2 representing the first or second group.
Values of 0 or -9 can be used to indicate missing covariate values, in
which case that individual will be excluded from analysis.
Linear and logistic models
These two features allow for multiple covariates when testing for both
quantitative trait and disease trait SNP association, and for
interactions with those covariates. The covariates can either
be continuous or binary (i.e. for categorical covariates,
you must first make a set of binary dummy variables).
WARNING! These commands are in some ways more flexible than
the standard --assoc command, but this comes with a price: namely,
these run more slowly...
In this section we consider:
- Basic uasge
- Covariate and interactions
- Flexibly specifying the precise model
- Flexibly specifying joint tests
Basic usage
For quantitative traits, use
plink --bfile mydata --linear
For disease traits, specify logistic regression with
plink --bfile mydaya --logistic
instead. All other commands in this section apply equally to both these models.
These commands will either generate the output file
plink.assoc.linear
or
plink.assoc.logistic
depending on the phenotype/command used. The basic format is:
CHR Chromosome
SNP SNP identifier
BP Physical position (base-pair)
A1 Tested allele (minor allele by default)
TEST Code for the test (see below)
NMISS Number of non-missing individuals included in analysis
BETA/OR Regression coefficient (--linear) or odds ratio (--logistic)
STAT Coefficient t-statistic
P Asymptotic p-value for t-statistic
For the additive effects of SNPs, the direction of the regression
coefficient represents the effect of each extra minor allele
(i.e. a positive regression coefficient means that the minor allele
increases risk/phenotype mean). If the --beta command is
added along with --logistic, then the regression coefficients
rather than the odds ratios will be returned.
NOTE Elsewhere in this documentation, the term
reference allele is sometimes used to refer to A1,
i.e. the --reference-allele command can be used to specify
which allele is A1. Note that in association testing, the odds ratios,
etc are typically calculated with A2 as the actual reference allele
(i.e. a positive OR means A1 increases risk relative to A2).
HINT Adding the --ci 0.95, for example,
option will given 95% confidence intervals for the estimated
parameters, in additional L95 and U95 fields in the
output files.
By itself, the --linear command will give identical results to the
Wald test from the --assoc command when applied to quantitative
traits.The --logistic command may give slightly different
results to the --assoc command for disease traits, but this is
because a different test/model is being applied (i.e. logistic regression
rather than allele counting). The difference may be particularly large for
very rare alleles (i.e. if the SNP is monomorphic in cases or controls,
then the logistic regression model is not well-defined and asymptotic
results might not hold for the basic test either).
When using --linear, adding the option
--standard-beta
will first standard the phenotype (mean 0, unit variance), so the
resulting coefficients will be standardized.
The TEST column is by default ADD meaning the additive
effects of allele dosage. Adding the option
--genotypic
will generate file which will have two extra tests per SNP, corresponding
to two extra rows: DOMDEV and GENO_2DF which represent a
separate test of the dominance component or a 2 df joint test of both
additive and dominance (i.e. corresponding the the general, genotypic
model in the --model command). Unlike the dominance model is the
--model, DOMDEV refers to a variable coded 0,1,0
for the three genotypes AA,Aa,aa, i.e. representing the
dominance deviation from additivity, rather specifying that a
particular allele is dominant or recessive. That is, the DOMDEV
term is fitted jointly with the ADD term in a single model.
NOTE! The coding PLINK uses with the 2
df --genotypic model involves two variables representing an
additive effect and a dominance deviation;
A D
AA 0 0
AB 1 1
BB 2 0
Although the 2df test will be identical, you would not
expect to see similar p-values, etc for the two individual terms if
instead you used a different version of "genotypic" coding, e.g. in
another analysis package, such as using dummy variables to represent
genotypes:
G1 G2
AA 0 0
AB 1 0
BB 0 1
That is, although fundamentally the same, in terms of the 2df test,
the interpretation of the two individual terms is different in these
two cases. To achieve this coding in PLINK (v1.02 onwards), add
the --hethom flag as well as --genotypic.
In a related note, you would not always expect the ADD
p-value to be the same when entering in the dominance term as it is
without it; if in doubt, you are advised to stick to just interpreting
the 2 df test if using the --genotypic option.
To specify a model assuming full dominance (or recessive) for the
minor allele (i.e. rather than the 2 df model mentioned above), you
can specify with either
--dominant
or
--recessive
Covariates and interactions
If a covariate file is also specified, then all covariates in that
file will be included in the regression model, labelled COV1, COV2, etc.
This is different to other commands which take only a single covariate (possibly
working in conjunction with the --mcovar option).
NOTE The --covar-name or --covar-number commands can be used
to select a subset of all covariates in the file, described here.
For example, if the covariate file is made as
described here and contains 2
covariates then the command
plink --bfile mydata --linear --genotypic --covar mycov.txt
will add two extra tests per SNP, COV1 and COV2. The p-value for the SNP term or terms
in the model will be adjusted for the covariates; that is, a single model is fit to the data (that also includes
a dominance term, as the --genotypic flag was also set):
Y = b0 + b1.ADD + b2.DOMDEV + b3.COV1 + b4.COV2 + e
(Note, using this notation, the genotypic test is of b1=b2=0.)
The output per each SNP might look something like:
CHR SNP BP A1 TEST NMISS OR STAT P
5 rs000001 10001 A ADD 664 0.7806 -1.942 0.05216
5 rs000001 10001 A DOMDEV 664 0.9395 -0.3562 0.7217
5 rs000001 10001 A COV1 664 0.9723 -0.7894 0.4299
5 rs000001 10001 A COV2 664 1.159 0.5132 0.6078
5 rs000001 10001 A GENO_2DF 664 NA 5.059 0.0797
That is, this represent coefficients from four terms in a multiple
regression of disease on ADD,
DOMDEV, COV1 and COV2 jointly. The final
test is a 2df test that tests the coefficients for ADD
and DOMDEV together. Importantly, the p-values for each line
reflect the effect of the entity under the TEST column, not
of the SNP whilst controlling for that particular covariate. (That is, p=0.0797
is the 2df test of the SNP whilst controlling for COV1 and COV2.)
HINT To suppress the multiple lines of output for
each covariate (which often are not of interest in themselves) add the
flag --hide-covar, i.e. the above would just read as follows
for this SNP:
CHR SNP BP A1 TEST NMISS OR STAT P
5 rs000001 10001 A ADD 664 0.7806 -1.942 0.05216
5 rs000001 10001 A GENO_2DF 664 NA 5.059 0.0797
HINT To condition analysis on a specific SNP when using --linear or --logistic,
use the --condition option, e.g.
plink --bfile mydata --linear --condition rs123456
will test all SNPs but adding the allelic dosage for rs123456 as a covariate. This
command can be used in conjunction with --covar and the other options listed here.
To condition on multiple SNPs, use, for example,
plink --bfile mydata --linear --condition-list snps.txt where snps.txt is a plain text file
contain a list of SNPs which are to be included as covariates. The output will now include terms that correspond
to the SNPs listed in the file snps.txt.
The conditioning SNPs are entered into the model simply as covariates,
using a simple 0, 1, 2 allele dosage coding. That is, for two conditioning SNPs, rs1001 and rs1002 say,
and also a standard covariate, the model would be
Y = b0 + b1.ADD + b2.rs1001 + b3.rs1002 + b4.COV1 + e
If the b1 coefficient for the test SNP is still significant after entering these covariates,
this would suggest that it does indeeed have an effect independent of rs1001, rs1002 and
the other covariate. (The other coefficients may still be highly significant, but these reflect the effects of
the conditioning SNPs and covariates, not the test SNP.)
If the --sex flag is added, then sex will be entered as a
covariate in the model (coded 1 for male, 0 for female), e.g
plink --bfile mydata --logistic --sex
If the option --interaction is added, then terms will be entered
which correspond to SNP x covariate interactions (with DOMDEV
as well as ADD if --genotypic is specified). In the
case of two covariates, without --genotypic, for example, the
command
plink --bfile mydata --linear --covar tmp.cov --interaction
results in the model
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e
NOTE Please remember that when interaction terms
are included in the model, the significance of the main effects can not
necessarily be interpreted straightforwardly (i.e. they will depend
on the arbitrary coding of the variables). In otherwords, when
including the --interaction flag, you should probably only
interpret the interaction p-value. Please refer to any standard text of
regression models if you are unclear on this.
Finally, a --test-all option drops all the terms in the model
in a multiple degree of freedom test.
Flexibly specifying the model
Use command such as --covar and --interaction will automatically
enter all covariates and possible SNP x covariate interactions. If one does not want to
test all of these, then use the --parameters flag to extract only the ones of
interest.
For example, to take the example above:
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e
If one only wanted ADD, the two covariates and the ADDxCOV2
but not the ADDxCOV1 interaction, then, from the above example,
you could use
plink --bfile mydata --linear --covar tmp.cov --interaction --parameters 1,2,3,5
That is, --parameters takes a comman-separated list of integers, starting from 1,
that represent the terms in the model (in the order in which they would appear if the
command were run without the --parameters flag). In this case:
ADD [1]
COV1 [2]
COV2 [3]
ADD x COV1 [4] <-- excluded
ADD x COV2 [5]
Flexibly specifying joint tests
To perform a user-defined joint test of more than one parameter, use the
--tests option. This takes a comma-delimited set of parameter
numbers, for example: if the model is
ADD [1]
COV1 [2]
COV2 [3]
ADDxCOV1 [4]
ADDxCOV2 [5]
then
plink --bfile mydate --linear --covar file.cov --interaction --tests
1,4,5
represeents a 3 degree of freedom test of ADD and the two
interactions.
Note, if this is used in conjunction with the --parameters
option, then the coding here refers to the reduced model -- for example,
the command
plink --bfile mydate --linear --covar file.cov --interaction
--parameters 1,2,3,5 --tests 1,4
performs a joint test of ADD and ADDxCOV2 (2df test)
whilst controlling for main effects of COV1 and COV2,
i.e. we do not use --tests 1,5, as there are now only 4
terms in the model:
--parameters 1,2,3,5 --tests 1,4
ADD [1] [1] TEST
COV1 [2] [2]
COV2 [3] [3]
ADDxCOV1 [4] n/a
ADDxCOV2 [5] [4] TEST
In other words, we fit the model
Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV2 + e
and jointly test the hypothesis
H0: b1 = b4 = 0
As mentioned above, use --test-all to drop all terms in the model
in a single joint test.
Multicollinearity
A common problem with multiple regression is that of multi-collinearity:
when the predictor variables are too strongly correlated to each other,
the parameter estimates will become unstable. PLINK tries to
detect this, and will display NA for the test statistic and
p-value for all terms in the model if there is evidence of
multi-collinearity. One common instance where this would occur would be
if one includes the --genotypic option but a SNP only has two of
the three possible genotype classes: in this case, ADD and
DOM will be perfectly correlated and PLINK will
display NA for both tests; this is basically telling you that
you should re-run without the --genotypic option for that
particular SNP. Similar principles apply to including covariates and
interactions terms: the more terms you include, the more likely you are
to have problems.
The --vif option can be used to specify the variance
inflation factor (VIF) used in the initial test for
multicollinearity. The default value is 10 -- smaller values represent
more stringent tests.
HINT If you have a quantitative trait, only want an additive model
and have only a single binary covariate, use the --gxe option (described above)
instead of --linear: it will run much faster (being based on a more simple
test of the difference of two regression slopes; it will not necessarily give
numerically identical results to the multiple regression approach, but asymptotically both
tests should be similar).
Set-based tests
These set-based tests are particularly suited to large-scale candidate
gene studies as opposed to whole genome association studies, as they
use permutaiton.
NOTE The basis of the set-based test has been changed in version 1.04 onwards.
This analysis works as follows:
- For each set, for each SNP determine which other SNPs are in LD,
above a certain threshold R
- Perform standard single SNP analysis (which might be basic
case/control association, family-based TDT or quantitative trait
analysis).
- For each set, select up to N "independent" SNPs (as
defined in step 1) with p-values below P. The best SNP is
selected first; subsequent SNPs are selected in order of descreasing
statistical significance, after removing SNPs in LD with previously
selected SNPs.
- From these subsets of SNPs, the statistic for each set is calculated
as the mean of these single SNP statistics
- Permute the dataset a large number of times, keeping LD between
SNPs constant (i.e. permute phenotype labels)
- For each permuted dataset, repeat steps 2 to 4 above.
- Empirical p-value for set (EMP1) is the number of times
the permuted set-statistic exceeds the original one for that set.
Note that the empirical p-values are corrected for the multiple SNPs
within a set (taking account of the LD between these SNPs). They are
not corrected for multiple testing if there is more than one set,
however (i.e. there is no equivalent of EMP2 (see the page
on permutation).
The critical parameters described above, R, N
and P can all be altered by the user, as described below.
To perform a set-based test the critical keywords are
--set-test
--set my.set
--mperm 10000
which state that we are performing a set-based test, which set-file to
use and how many permutations to perform (this last command is
necessary). As mentioned above, the --assoc command could be
replaced by --tdt, or --logistic, etc.
The set file my.set is in form
SET1
rs1234
rs28384
rs29334
END
SET2
rs4774
rs662662
rs77262
END
...
For example,
plink --file mydata --set-test --set my.set --mperm 10000 --assoc
would display in the LOG file the following critical parameters with
their default values
Performed LD-based set test, with parameters:
r-squared (--set-r2) = 0.5
p-value (--set-p) = 0.05
max # SNPs (--set-max) = 5
The output is written to a file with a .set.mperm extension, for example
plink.assoc.set.mperm
with the fields
SET Set name
NSNP Number of SNPs in set
NSIG Total number of SNPs below p-value threshold
ISIG Number of significant SNPs also passing LD-criterion
STAT Average test statistic based on ISIG SNPs
EMP1 Empirical set-based p-value
SNPS List of SNPs in the set
For example, here is output from a case/control dataset with SNPs for
five related genes (lines truncated)
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 0 0 0 1 NA
GABRA6 6 4 3 5.199 0.09489 rs3811991|rs2197414|...
GABRA1 22 11 5 5.951 0.09459 rs4254937|rs4260711|...
GABRG2 24 0 0 0 1 NA
GABRP 17 2 1 7.64 0.0269 rs7736504
Here the first gene, GABRB2 has 45 SNPs, but none of these
are significant at p=0.05, and so the empirival p-value is necessarily
1.00. The next gene has 6 SNPs, 4 of which are significant, but only 3
of which are independently significant based on an r-squared threshold
of 0.5. The STAT of 5.199 is the average chi-squared
statistic across these three SNPs. It should not be interpreted in
itself -- rather, you should consider the EMP1 significance
value based on it. In this case, P=0.095. The final
gene, GABRP is nominally significant here, P=0.027, but this
does not correct for the 5 genes tested of course.
Naturally, different thresholds will produce different
results. Depending on the unknown genetic architecture, these may vary
substantially and meaningfully so. In general, if the set represents a
very large pathway (dozens of genes) you might want to
increase --set-max. There are probably no hard and fast rules
with regard to how to set --set-p and --set-r2,
except to say that running under a large number of settings and
selecting the most significant is not a good idea.
Running with a "stricter" set of values
--set-r2 0.1
--set-p 0.01
--set-max 2
we see a broadly similar pattern of results; naturally, the
thresholding on p-value means that GABRA6 goes from showing
some signal to asbolutely no signal.
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 0 0 0 1 NA
GABRA6 6 0 0 0 1 NA
GABRA1 22 2 2 7.464 0.05949 rs4254937|rs4260711
GABRG2 24 0 0 0 1 NA
GABRP 17 1 1 7.64 0.06309 rs7736504
Alternatively, a more inclusive setting might be something like
--set-r2 0.8
--set-p 1
--set-max 10
which, in this particular case, happens to yield slightly stronger
signals for GABRA6 and GABRA1 but weaker
for GABRp (lines truncated)
SET NSNP NSIG ISIG STAT EMP1 SNPS
GABRB2 45 12 10 1.749 0.7162 hCV26311691|...
GABRA6 6 6 6 3.998 0.0184 rs3811991|...
GABRA1 22 13 10 5.277 0.0182 rs4254937|...
GABRG2 24 11 10 0.6976 0.9099 hCV3167705|...
GABRP 17 10 10 2.753 0.1225 rs7736504|...
HINT Two extremes are to perform a test based on a) the best single SNP result per set:
--set-max 1
--set-p 1
or to use all SNPs in a set:
--set-max 99999
--set-p 1
--set-r2 1
Adjustment for multiple testing: Bonferroni, Sidak, FDR, etc
To generate a file of adjusted significance values that correct for
all tests performed and other metrics, use the option:
plink --file mydata --assoc --adjust
which generates the file
plink.adjust
which contains the fields
CHR Chromosome number
SNP SNP identifer
UNADJ Unadjusted p-value
GC Genomic-control corrected p-values
BONF Bonferroni single-step adjusted p-values
HOLM Holm (1979) step-down adjusted p-values
SIDAK_SS Sidak single-step adjusted p-values
SIDAK_SD Sidak step-down adjusted p-values
FDR_BH Benjamini & Hochberg (1995) step-up FDR control
FDR_BY Benjamini & Yekutieli (2001) step-up FDR control
This file is sorted by significance value rather than genomic location, the
most significant results being at the top.
WARNING Currently, these procedures are only
implemented for asymptotic significance values for the standard TDT
and association (disease trait and quantitative
trait, --assoc, --linear, --logistic) tests
and the 2x2xK Cochran-Mantel-Haenszel test. Future versions will
allow these results for empirical significance values and for other
tests (e.g. epistasis, etc).
|
|