1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Familybased association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Metaanalysis
21. Annotation
22. LDbased results clumping
23. Genebased report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. Rplugins
28. Annotation weblookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flowchart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK


Familybased association analysis
The main focus of PLINK is for populationbased samples. There is some
support for familybased analyses however, described in this section,
for disease traits and quantitative traits.
Familybased association (TDT)
PLINK supports basic familybased association testing for disease
traits, using the TDT and a variant of this test that also incorporates
parental phenotype information, the parenTDT.
To run a basic TDT analysis for family data:
plink file mydata tdt
which generates the file
plink.tdt
If permutation has been requested, then either
plink.tdt.perm
or
plink.tdt.mperm
will be generated also. The main output file, plink.tdt,
contains the following fields:
CHR Chromosome number
SNP SNP identifier
A1 Minor allele code
A2 Major allele code
T Transmitted minor allele count
U Untransmitted allele count
OR TDT odds ratio
CHISQ TDT chisquare statistic
P TDT asymptotic pvalue
A:U_PAR Parental discordance counts
CHISQ_PAR Parental discordance statistic
P_PAR Parental discordance asymptotic pvalue
CHISQ_COM Combined test statistic
P_COM Combined test asymptotic pvalue
If the ci option has been requested, then two additional
fields will appear after TDT_OR:
L95 Lower 95% confidence interval for TDT odds ratio
U95 Upper 95% confidence interval for TDT odds ratio
(naturally, if a value other than 0.95 was used as the argument for the
ci option, it will appear here instead.)
The TDT statistic is calculated simply as
(bc)^2 / (b+c)
where b and c are the number of transmitted and
untransmitted alleles as shown in plink.tdt; under the
null, it is distributed as a 1df chisquared.
The parental discordance test is based on counting the number of
alleles in affected versus unaffected parents, treating each
nuclear family parental pair as a matched pair. These counts
can be combined with the T and U counts of the
basic TDT to give a combined test statistic, also shown in the output.
The parenTDT assumes homogeneity within families rather than between
families, in terms of population stratification. If parents are measured
on the phenotype, then this test can add considerable power to
familybased association analysis, whilst providing a strong degree (but
not complete) protection against population stratification. The increase
in power will depend on the proportion of parents that are discordant
for the disease. This approach is described in Purcell et al
AJHG (2005). PLINK uses a more simple approach to calculate the
PAR and COM statistics, however: if
Unaffeced parent
A/A A/B B/B
Affected A/A  p r
parent A/B x  q
B/B z y 
i.e. such that the A:U_PAR fields represents p+q+2r : x+y+2z,
then
PAR = ( (p+q+2r)  (x+y+2z) )^2
/ ( p+q+x+y+4(r+z) )
and
COM = ( ( b+p+q+2r )  ( c+x+y+2z ) )^2
/ ( b+p+q+c+x+y+4(r+z) )
Both statistics follow a 1 df chisquared distribution under the null.
When running the tdt option, PLINK will first
perform a check for Mendel errors and make missing the offending
genotypes.
Using the tdt option, if permutation is requested (using
either perm or mperm) a file entitled either
plink.tdt.perm
or
plink.tdt.mperm
will be generated: the empirical pvalue will be based on the standard TDT
test. The permutation procedure will flip transmitted/untransmitted status
constantly for all SNPs for a given family, thereby preserving the LD and
linkage information between markers and siblings.
parenTDT
The parenTDT, described in the paragraph above, is
automatically included when using the tdt option. These
alternate commands generate the same output as for the tdt
command, described above, except the permutation is based not on the
standard TDT, but either the parenTDT if using the option
plink file mydata parentdt1
or, the combined test (TDT and parenTDT) if using the option
plink file mydata parentdt2
Parent of origin analysis
When performing familybased TDT analysis, it is possible to
separately consider transmissions from heterozygous fathers
versus heterozygous mothers to affected offspring. This is performed
by adding the poo to request parentoforigin analysis:
plink file mydata tdt poo
which generates the file plink.tdt.poo. If permutation is also requested,
this also generates the file plink.tdt.poo.perm or plink.tdt.poo.mperm,
depending which permutation procedure is used. The main output file has the following
format:
CHR Chromosome number
SNP SNP identifier
A1:A2 Allele 1 : allele 2 codes
T:U_PAT Paternal transmitted : untransmitted counts
OR_PAT Paternal odds ratio
CHISQ_PAT Paternal chisquared test
T:U_MAT Maternal, as above
OR_MAT Matneral, as above
CHISQ_MAT Maternal, as above
Z_POO Z score for difference in paternal versus maternal odds ratios
P_POO Asymptotic pvalue for parentoforigin test
If permutation is requested, the default test statistic is the absolute value of the Z
score for the parentoforigin test (i.e. making a twosided test). The flags
pat and mat indicate that the permutation statistic should
be the paternal TDT chisquared statistics, or the maternal statistic, instead.
NOTE When both parents are heterozygous, these ambiguous transmissions
are counted as 0.5 for both mother and father  this is why the T:U counts will often
not be whole numbers.
DFAM: familybased association for disease traits
The DFAM procedure in PLINK implements the sibTDT and also allows for
unrelated individuals to be included (via a clusteredanalysis using
the CochranMantelHaesnzel). To perform this test:
plink bfile mydata dfam
which generates the file
plink.dfam
which contains the fields
CHR Chromosome code
SNP SNP identifier
A1:A2 Minor and major allele codes
OBS Number of observed minor alleles
EXP Number of expected minor alleles
CHISQ Chisquared test statistic
P Asymptotic pvalue
This test can therefore be used to combine discordant sibship data,
parentoffspring trio data and unrelated case/control data in a single
analysis.
NOTE If you are analysing a siblingonly sample (i.e.
no parents) then also add the nonfounders option; otherwise,
all SNPs will be pruned out at the filtering stage, as PLINK will by
default only consider founder alleles when calculating allele frequency,
HardyWeinberg, etc.
QFAM: familybased association tests for quantitative traits
PLINK offers a somewhat adhoc procedure to perform familybased tests of
association with quantitative phenotypes: the QFAM procedure, which uses
permutation to account for the dependence between related individuals.
It adopts the between/within model as used by Fulker et al (1999, AJHG)
and Abecasis et al (2000, AJHG) as implemented in the QTDT package.
However, rather than fitting a maximum likelihood variance components
model, as QTDT does, PLINK performs a simple linear regression of
phenotype on genotype, but then uses a special permutation procedure to
correct for family structure.
There are several ways to run QFAM: a total association test (between and
within components)
plink bfile mydata qfamtotal mperm 100000
or a withinfamily test
plink bfile mydata qfam mperm 100000
or a test including parental phenotypes
plink bfile mydata qfamparents mperm 100000
(Also, qfambetween will look only at the betweenfamily
component of association).
NOTE In all cases above, we have used
mperm to specify permutation; adaptive permutation can
also be used with QFAM (perm). Permutation is necessary for
the QFAM test.
The columns in the QFAM permutation result files are:
CHR Chromosome code
SNP SNP identifier
STAT Test statistic (ignore)
EMP1 Pointwise empirical pvalue
NP Number of permutations performed
The columns in the nonpermutation file (e.g. plink.qfam.total, if plink.qfam.total.mperm
contains the permuted results) are as follows:
CHR Chromosome code
SNP SNP identifier
A1 Minor allele (corresponds to beta given below; absent in earlier PLINK releases)
TEST Type of test, TOT, WITH and BET
NMISS Number of nonmissing individuals in analysis
BETA Regression coefficient
STAT Test statistic (ignore; not corrected for familystructure)
P Asymptotic pvalue (ignore; use empirical pvalue)
These results are from a standard linear type analysis, i.e. which ignores family structure.
They are displayed so that the direction of effect may be determined (from the BETA)  but
otherwise, only the empirical pvalue from the permuted results file should be looked at.
The B and W components are calcalated using parental genotypes if
they are available for both parents, otherwise siblings are used.
Singletons can be included in this analysis (i.e. B=G and W=0 for them):
for example, the scores are shown below for a few configurations, when
parents are available:
Genotype G
AA 1
Aa 0
aa 1
B = ( Pat + Mat ) / 2
W = G  B
Pat Mat Offspring G B W
AA AA AA 1 1 0
Aa AA AA 1 0.5 0.5
Aa AA Aa 0 0.5 0.5
aa AA Aa 0 0 0
etc
The QFAM permutation procedure breaks down the genotypes into between
(B) and within (W) components, permutes them independently (i.e. at the
family level, either swapping the B component for one family with another
family, or flipping the sign of all W's in a family with 50:50 chance)
and then (for the total association test) reconstructs the individual
level "genotypes" as the sum of the
new B's and W's i.e:
1) G > B + W (individuallevel)
2a) Permute B (familylevel) > B'
2b) Permute W (familylevel) > W'
3) B' + W' > G' (individuallevel)
The logic is that we know how to permute both B and W separately whilst
maintaining the familial structural component, and they are orthogonal
components, so we should permute them separately, but then recombine them
as a single individuallevel genotypic score.
NOTE The total qfamtotal test is designed to
extract all association information from a familybased sample,
controlling for relatedness: it is not robust to stratification. Use the
qfam for a strictly within family test.
In many circumstances, the standard QTDT as implemented in Goncalo
Abecasis' QTDT
program will perhaps be more appropriate. The disadvantages of the QFAM
procedure are
 that it uses permutation and so is slower
 appears to be slightly less powerful when there is a higher residual
correlation
On the plus side, the advantages of the QFAM procedure are
 that is uses permutation and so is appropriate for nonnormal
phenotypes; it could also be used for disease phenotypes, although it will
not be appropriate for affectedonly TDT style designs
 that it can be applied to genomewide data easily (albeit not
necessarily quickly)
Technical note As a technical point: when permuting
genotype between families in this way, one has to be careful with missing
genotype data, particularly in the instance in which a family is
completely missing. Because a missing B component cannot be recombined
with a nonmissing W component, and vice versa, this process would tend
to increase the amount of missingness in permutations versus the original
data.
One could exclude individuals with missing genotypes first and
permute separately for each SNP, but this would no longer maintain
the correlation between SNPs (and require more computation). Instead, we
use the following scheme. We permute once per replicate (e.g. a table of
F (original family) and F' (permuted family), true and permuted
families). e.g. but let's say that 2 is missing their B component
(denoted 2*)
For example:
F F'
0 5
1 2* < remove ?
2* 4 < remove ?
3 1
4 0
5 3
This would knock out families 1 and 2 from the permutation. We therefore
permute once to create a single table for permutation for all SNPs, but
then resursively edit the table on a SNPbySNP basis, to regroup the
missing families, by swapping missing F' families: in this case, swap 2*
with 4 (the other partner of 2*), e.g.
F F'
0 5
1 4
2* 2* < remove
3 1
4 0
5 3
So now we have a permuted sample but the total level of missingness is the
same. This procedure still generates valid, completely random permutations
of the nonmissing genotype data and trys to maintain as much of the
correlation between SNPs as possible (i.e. as typically only a small % of
genotypes are missing and so we do not need to edit the table much).

