PLINK: Whole genome data analysis toolsetPLINK is a whole genome
association analysis toolset, designed to perform large-scale analyses
in a computationally efficient manner. By using a binary
representation of SNP data, very large datasets can be loaded into
memory in their entirety and processed very efficiently. For example,
to calculate the chi-square statistics and odds ratios for all SNPs in
a 100K dataset containing 350 individuals took ~2 seconds on a
standard Linux workstation, making a permutation-based approach to
whole genome analysis feasible.
PLINK will generate a
number of standard summary statistics that are useful for quality
control and can be used as thresholds for subsequent analyses
(e.g. missing genotype rate, minor allele frequency, Hardy-Weinberg
equilibrium failures and non-Mendelian transmission rates) as well as
some more novel ones (estimated inbreeding coefficients for each
individual and genome-wide identity-by-state and identity-by-descent
estimates for all pairs of individuals). The later can be used to
detect sample contaminations, swaps and duplications as well as
pedigree errors and unknown familial relationships (e.g. sibling pairs
in a case/control population-based sample). PLINK also provides a simple
interface for recoding, reordering, flipping DNA-strand and extracting
subsets of SNP data.
A simple but powerful approach to population stratification is
included in PLINK, that
can use whole genome SNP data in a computationally efficient
manner. We use complete linkage agglomerative clustering, based on
pairwise IBS distance, but with some modifications to the clustering
process: restrictions based on a significance test for whether two
individuals belong to the same population (i.e. do not merge clusters
that contain significantly different individuals) , a phenotype
criterion (i.e. all pairs must contain at least one case and one
control) and a cluster size restriction (i.e. such that, with a
cluster size of 2, for example, the subsequent association test would
implicitly match every case with its nearest control, as long as the
case and control do not show evidence of belonging to different
populations). Any evidence of population substructure (from this or
any other analysis) can be incorporated in subsequent association
tests via the specification of clusters, as at each permutation step
of the tests described below, individuals are only permuted within
cluster.
The basic association test is for a disease trait and is based on
comparing allele frequencies between cases and controls (asymptotic
and empirical p-values are available). Also implemented are the
Cochran-Armitage trend test, dominant and recessive models and a two
degree of freedom general model. The ability to compare different
models by likelihood ratio test and to evaluate the significant of the
most significant model by permutation are also incorporated. We also
test for difference in missing genotype rates between cases and
controls.
Family-based data can be analyzed with the basic TDT, using either
asymptotic or empirical significance values. In addition, the basic
test is modified to include information on parental phenotypes to give
a more powerful combined test. The permutation procedure will flip
transmitted/untransmitted status constantly for all SNPs for a given
family, thereby preserving the LD and linkage information between
markers and siblings.
Quantitative traits can be tested for association also, using either
asymptotic (likelihood ratio test and Wald test) or empirical
significance values. As above, any clustering scheme can be specified
for the permutations: in this way, it is possible to control for
subpopulation membership as well as family structure, if related
individuals are being analyzed.
For all tests (disease, quantitative and family-based) it is possible
to specify 'sets' of SNPs to calculate 'set-based' or 'gene-based'
test statistics: these are based on cumulative sums of rank-ordered
single SNP statistics, and are evaluated by permutation.
For the disease-trait population-based samples, it is possible to test
for epistasis. The epistasis test can either be case-only or
case-control. Either all pairwise combinations of SNPs can be tested
(although this is most likely not desirable, it is computationally
feasible using PLINK -- the 4.5 billion two-locus tests generated from
a 100K data set took just over 24 hours to run) or sets can be
specified (e.g. to test only the most significant 100 SNPs against all
other SNPs, or against themselves, etc). The output consists only
pairwise epistatic results above a certain significance value; also,
for each SNP, a summary of all the pairwise epistatic tests is given
(e.g. maximum test, proportion of tests significant at a certain
threshold, etc). A similar methodology allows for testing of
gene-environment interaction (for dichotomous environmental
variables).
All tests described above are based on single SNP tests. It is also
possible to impute haplotypes based on multimarker predictors using
the standard E-M algorithm and either perform the test on the
posterior distribution of haplotypes given genotypes for each
individual (for main disease and quantitative trait population-based
association tests only), alternatively to create a new file with the
most likely haplotype pair imputed (or set to missing if the most
likely phase is below a certain value).
This document last modified