1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Family-based association analysis
The main focus of PLINK is for population-based samples. There is some
support for family-based analyses however, described in this section,
for disease traits and quantitative traits.
Family-based association (TDT)
PLINK supports basic family-based association testing for disease
traits, using the TDT and a variant of this test that also incorporates
parental phenotype information, the parenTDT.
To run a basic TDT analysis for family data:
plink --file mydata --tdt
which generates the file
plink.tdt
If permutation has been requested, then either
plink.tdt.perm
or
plink.tdt.mperm
will be generated also. The main output file, plink.tdt,
contains the following fields:
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
T Transmitted minor allele count
U Untransmitted allele count
OR TDT odds ratio
CHISQ TDT chi-square statistic
P TDT asymptotic p-value
A:U_PAR Parental discordance counts
CHISQ_PAR Parental discordance statistic
P_PAR Parental discordance asymptotic p-value
CHISQ_COM Combined test statistic
P_COM Combined test asymptotic p-value
If the --ci option has been requested, then two additional
fields will appear after TDT_OR:
L95 Lower 95% confidence interval for TDT odds ratio
U95 Upper 95% confidence interval for TDT odds ratio
(naturally, if a value other than 0.95 was used as the argument for the
--ci option, it will appear here instead.)
The TDT statistic is calculated simply as
(b-c)^2 / (b+c)
where b and c are the number of transmitted and
untransmitted alleles as shown in plink.tdt; under the
null, it is distributed as a 1df chi-squared.
The parental discordance test is based on counting the number of
alleles in affected versus unaffected parents, treating each
nuclear family parental pair as a matched pair. These counts
can be combined with the T and U counts of the
basic TDT to give a combined test statistic, also shown in the output.
The parenTDT assumes homogeneity within families rather than between
families, in terms of population stratification. If parents are measured
on the phenotype, then this test can add considerable power to
family-based association analysis, whilst providing a strong degree (but
not complete) protection against population stratification. The increase
in power will depend on the proportion of parents that are discordant
for the disease. This approach is described in Purcell et al
AJHG (2005). PLINK uses a more simple approach to calculate the
PAR and COM statistics, however: if
Unaffeced parent
A/A A/B B/B
Affected A/A - p r
parent A/B x - q
B/B z y -
i.e. such that the A:U_PAR fields represents p+q+2r : x+y+2z,
then
PAR = ( (p+q+2r) - (x+y+2z) )^2
/ ( p+q+x+y+4(r+z) )
and
COM = ( ( b+p+q+2r ) - ( c+x+y+2z ) )^2
/ ( b+p+q+c+x+y+4(r+z) )
Both statistics follow a 1 df chi-squared distribution under the null.
When running the --tdt option, PLINK will first
perform a check for Mendel errors and make missing the offending
genotypes.
Using the --tdt option, if permutation is requested (using
either --perm or --mperm) a file entitled either
plink.tdt.perm
or
plink.tdt.mperm
will be generated: the empirical p-value will be based on the standard TDT
test. The permutation procedure will flip transmitted/untransmitted status
constantly for all SNPs for a given family, thereby preserving the LD and
linkage information between markers and siblings.
parenTDT
The parenTDT, described in the paragraph above, is
automatically included when using the --tdt option. These
alternate commands generate the same output as for the --tdt
command, described above, except the permutation is based not on the
standard TDT, but either the parenTDT if using the option
plink --file mydata --parentdt1
or, the combined test (TDT and parenTDT) if using the option
plink --file mydata --parentdt2
Parent of origin analysis
When performing family-based TDT analysis, it is possible to
separately consider transmissions from heterozygous fathers
versus heterozygous mothers to affected offspring. This is performed
by adding the --poo to request parent-of-origin analysis:
plink --file mydata --tdt --poo
which generates the file plink.tdt.poo. If permutation is also requested,
this also generates the file plink.tdt.poo.perm or plink.tdt.poo.mperm,
depending which permutation procedure is used. The main output file has the following
format:
CHR Chromosome number
SNP SNP identifier
A1:A2 Allele 1 : allele 2 codes
T:U_PAT Paternal transmitted : untransmitted counts
OR_PAT Paternal odds ratio
CHISQ_PAT Paternal chi-squared test
T:U_MAT Maternal, as above
OR_MAT Matneral, as above
CHISQ_MAT Maternal, as above
Z_POO Z score for difference in paternal versus maternal odds ratios
P_POO Asymptotic p-value for parent-of-origin test
If permutation is requested, the default test statistic is the absolute value of the Z
score for the parent-of-origin test (i.e. making a two-sided test). The flags
--pat and --mat indicate that the permutation statistic should
be the paternal TDT chi-squared statistics, or the maternal statistic, instead.
NOTE When both parents are heterozygous, these ambiguous transmissions
are counted as 0.5 for both mother and father -- this is why the T:U counts will often
not be whole numbers.
DFAM: family-based association for disease traits
The DFAM procedure in PLINK implements the sib-TDT and also allows for
unrelated individuals to be included (via a clustered-analysis using
the Cochran-Mantel-Haesnzel). To perform this test:
plink --bfile mydata --dfam
which generates the file
plink.dfam
which contains the fields
CHR Chromosome code
SNP SNP identifier
A1:A2 Minor and major allele codes
OBS Number of observed minor alleles
EXP Number of expected minor alleles
CHISQ Chi-squared test statistic
P Asymptotic p-value
This test can therefore be used to combine discordant sibship data,
parent-offspring trio data and unrelated case/control data in a single
analysis.
NOTE If you are analysing a sibling-only sample (i.e.
no parents) then also add the --nonfounders option; otherwise,
all SNPs will be pruned out at the filtering stage, as PLINK will by
default only consider founder alleles when calculating allele frequency,
Hardy-Weinberg, etc.
QFAM: family-based association tests for quantitative traits
PLINK offers a somewhat ad-hoc procedure to perform family-based tests of
association with quantitative phenotypes: the QFAM procedure, which uses
permutation to account for the dependence between related individuals.
It adopts the between/within model as used by Fulker et al (1999, AJHG)
and Abecasis et al (2000, AJHG) as implemented in the QTDT package.
However, rather than fitting a maximum likelihood variance components
model, as QTDT does, PLINK performs a simple linear regression of
phenotype on genotype, but then uses a special permutation procedure to
correct for family structure.
There are several ways to run QFAM: a total association test (between and
within components)
plink --bfile mydata --qfam-total --mperm 100000
or a within-family test
plink --bfile mydata --qfam --mperm 100000
or a test including parental phenotypes
plink --bfile mydata --qfam-parents --mperm 100000
(Also, --qfam-between will look only at the between-family
component of association).
NOTE In all cases above, we have used
--mperm to specify permutation; adaptive permutation can
also be used with QFAM (--perm). Permutation is necessary for
the QFAM test.
The columns in the QFAM permutation result files are:
CHR Chromosome code
SNP SNP identifier
STAT Test statistic (ignore)
EMP1 Pointwise empirical p-value
NP Number of permutations performed
The columns in the non-permutation file (e.g. plink.qfam.total, if plink.qfam.total.mperm
contains the permuted results) are as follows:
CHR Chromosome code
SNP SNP identifier
A1 Minor allele (corresponds to beta given below; absent in earlier PLINK releases)
TEST Type of test, TOT, WITH and BET
NMISS Number of non-missing individuals in analysis
BETA Regression coefficient
STAT Test statistic (ignore; not corrected for family-structure)
P Asymptotic p-value (ignore; use empirical p-value)
These results are from a standard --linear type analysis, i.e. which ignores family structure.
They are displayed so that the direction of effect may be determined (from the BETA) -- but
otherwise, only the empirical p-value from the permuted results file should be looked at.
The B and W components are calcalated using parental genotypes if
they are available for both parents, otherwise siblings are used.
Singletons can be included in this analysis (i.e. B=G and W=0 for them):
for example, the scores are shown below for a few configurations, when
parents are available:
Genotype G
AA 1
Aa 0
aa -1
B = ( Pat + Mat ) / 2
W = G - B
Pat Mat Offspring G B W
AA AA AA 1 1 0
Aa AA AA 1 0.5 0.5
Aa AA Aa 0 0.5 -0.5
aa AA Aa 0 0 0
etc
The QFAM permutation procedure breaks down the genotypes into between
(B) and within (W) components, permutes them independently (i.e. at the
family level, either swapping the B component for one family with another
family, or flipping the sign of all W's in a family with 50:50 chance)
and then (for the total association test) reconstructs the individual
level "genotypes" as the sum of the
new B's and W's i.e:
1) G -> B + W (individual-level)
2a) Permute B (family-level) -> B'
2b) Permute W (family-level) -> W'
3) B' + W' -> G' (individual-level)
The logic is that we know how to permute both B and W separately whilst
maintaining the familial structural component, and they are orthogonal
components, so we should permute them separately, but then recombine them
as a single individual-level genotypic score.
NOTE The total --qfam-total test is designed to
extract all association information from a family-based sample,
controlling for relatedness: it is not robust to stratification. Use the
--qfam for a strictly within family test.
In many circumstances, the standard QTDT as implemented in Goncalo
Abecasis' QTDT
program will perhaps be more appropriate. The disadvantages of the QFAM
procedure are
- that it uses permutation and so is slower
- appears to be slightly less powerful when there is a higher residual
correlation
On the plus side, the advantages of the QFAM procedure are
- that is uses permutation and so is appropriate for non-normal
phenotypes; it could also be used for disease phenotypes, although it will
not be appropriate for affected-only TDT style designs
- that it can be applied to genome-wide data easily (albeit not
necessarily quickly)
Technical note As a technical point: when permuting
genotype between families in this way, one has to be careful with missing
genotype data, particularly in the instance in which a family is
completely missing. Because a missing B component cannot be recombined
with a non-missing W component, and vice versa, this process would tend
to increase the amount of missingness in permutations versus the original
data.
One could exclude individuals with missing genotypes first and
permute separately for each SNP, but this would no longer maintain
the correlation between SNPs (and require more computation). Instead, we
use the following scheme. We permute once per replicate (e.g. a table of
F (original family) and F' (permuted family), true and permuted
families). e.g. but let's say that 2 is missing their B component
(denoted 2*)
For example:
F F'
0 5
1 2* <- remove ?
2* 4 <- remove ?
3 1
4 0
5 3
This would knock out families 1 and 2 from the permutation. We therefore
permute once to create a single table for permutation for all SNPs, but
then resursively edit the table on a SNP-by-SNP basis, to regroup the
missing families, by swapping missing F' families: in this case, swap 2*
with 4 (the other partner of 2*), e.g.
F F'
0 5
1 4
2* 2* <- remove
3 1
4 0
5 3
So now we have a permuted sample but the total level of missingness is the
same. This procedure still generates valid, completely random permutations
of the non-missing genotype data and trys to maintain as much of the
correlation between SNPs as possible (i.e. as typically only a small % of
genotypes are missing and so we do not need to edit the table much).
|
|