1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Familybased association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Metaanalysis
21. Annotation
22. LDbased results clumping
23. Genebased report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. Rplugins
28. Annotation weblookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flowchart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK


PLINK is a whole genome
association analysis toolset, designed to perform largescale analyses
in a computationally efficient manner. By using a binary
representation of SNP data, very large datasets can be loaded into
memory in their entirety and processed very efficiently. For example,
to calculate the chisquare statistics and odds ratios for all SNPs in
a 100K dataset containing 350 individuals took ~2 seconds on a
standard Linux workstation, making a permutationbased approach to
whole genome analysis feasible.
PLINK will generate a
number of standard summary statistics that are useful for quality
control and can be used as thresholds for subsequent analyses
(e.g. missing genotype rate, minor allele frequency, HardyWeinberg
equilibrium failures and nonMendelian transmission rates) as well as
some more novel ones (estimated inbreeding coefficients for each
individual and genomewide identitybystate and identitybydescent
estimates for all pairs of individuals). The later can be used to
detect sample contaminations, swaps and duplications as well as
pedigree errors and unknown familial relationships (e.g. sibling pairs
in a case/control populationbased sample). PLINK also provides a simple
interface for recoding, reordering, flipping DNAstrand and extracting
subsets of SNP data.
A simple but powerful approach to population stratification is
included in PLINK, that
can use whole genome SNP data in a computationally efficient
manner. We use complete linkage agglomerative clustering, based on
pairwise IBS distance, but with some modifications to the clustering
process: restrictions based on a significance test for whether two
individuals belong to the same population (i.e. do not merge clusters
that contain significantly different individuals) , a phenotype
criterion (i.e. all pairs must contain at least one case and one
control) and a cluster size restriction (i.e. such that, with a
cluster size of 2, for example, the subsequent association test would
implicitly match every case with its nearest control, as long as the
case and control do not show evidence of belonging to different
populations). Any evidence of population substructure (from this or
any other analysis) can be incorporated in subsequent association
tests via the specification of clusters, as at each permutation step
of the tests described below, individuals are only permuted within
cluster.
The basic association test is for a disease trait and is based on
comparing allele frequencies between cases and controls (asymptotic
and empirical pvalues are available). Also implemented are the
CochranArmitage trend test, dominant and recessive models and a two
degree of freedom general model. The ability to compare different
models by likelihood ratio test and to evaluate the significant of the
most significant model by permutation are also incorporated. We also
test for difference in missing genotype rates between cases and
controls.
Familybased data can be analyzed with the basic TDT, using either
asymptotic or empirical significance values. In addition, the basic
test is modified to include information on parental phenotypes to give
a more powerful combined test. The permutation procedure will flip
transmitted/untransmitted status constantly for all SNPs for a given
family, thereby preserving the LD and linkage information between
markers and siblings.
Quantitative traits can be tested for association also, using either
asymptotic (likelihood ratio test and Wald test) or empirical
significance values. As above, any clustering scheme can be specified
for the permutations: in this way, it is possible to control for
subpopulation membership as well as family structure, if related
individuals are being analyzed.
For all tests (disease, quantitative and familybased) it is possible
to specify 'sets' of SNPs to calculate 'setbased' or 'genebased'
test statistics: these are based on cumulative sums of rankordered
single SNP statistics, and are evaluated by permutation.
For the diseasetrait populationbased samples, it is possible to test
for epistasis. The epistasis test can either be caseonly or
casecontrol. Either all pairwise combinations of SNPs can be tested
(although this is most likely not desirable, it is computationally
feasible using PLINK  the 4.5 billion twolocus tests generated from
a 100K data set took just over 24 hours to run) or sets can be
specified (e.g. to test only the most significant 100 SNPs against all
other SNPs, or against themselves, etc). The output consists only
pairwise epistatic results above a certain significance value; also,
for each SNP, a summary of all the pairwise epistatic tests is given
(e.g. maximum test, proportion of tests significant at a certain
threshold, etc). A similar methodology allows for testing of
geneenvironment interaction (for dichotomous environmental
variables).
All tests described above are based on single SNP tests. It is also
possible to impute haplotypes based on multimarker predictors using
the standard EM algorithm and either perform the test on the
posterior distribution of haplotypes given genotypes for each
individual (for main disease and quantitative trait populationbased
association tests only), alternatively to create a new file with the
most likely haplotype pair imputed (or set to missing if the most
likely phase is below a certain value).

