PLINK: Whole genome data analysis toolset
[an error occurred while processing this directive]
Population stratification
A simple but powerful approach to population stratification is
included in PLINK, that can use whole genome SNP data in a
computationally efficient manner. We use complete linkage
agglomerative clustering, based on pairwise IBS distance, but with
some modifications to the clustering process: restrictions based on a
significance test for whether two individuals belong to the same
population (i.e. do not merge clusters that contain significantly
different individuals) , a phenotype criterion (i.e. all pairs must
contain at least one case and one control) and a cluster size
restriction (i.e. such that, with a cluster size of 2, for example,
the subsequent association test would implicitly match every case with
its nearest control, as long as the case and control do not show
evidence of belonging to different populations). Any evidence of
population substructure (from this or any other analysis) can be
incorporated in subsequent association tests via the specification of
clusters, as at each permutation step of the tests described below,
individuals are only permuted within cluster.
All these analyses require a large number of SNPs!IBS distance matrix
To create matrix of IBS pairwise distances
plink --file mydata --matrix
creates two files
plink.mdist
plink.pdist
which contain the IBS distances and p-value tests for all pairs of
individuals.
HINT See the FAQ
page for instructions on using using R to visualise these results.
IBS clustering
To perform the clustering:
plink --file mydata --cluster
The output is sent to four files
that contain similar information but in different formats. The
The cluster0 file contains information on the merging process.
The cluster1 file contains information on the final solution, listed by cluster: e.g. for 4 individuals in 3 clusters:
0 A
1 B C
2 D
The cluster2 file contains the same information but listed by individual:
A 0
B 1
C 1
D 2
The cluster3 file is in the same format as cluster2 but contains all solutions, each column is a solution: e.g.
A 0 0 0 0
B 1 1 1 0
C 2 1 1 0
D 3 2 0 0
i.e. reading from left to right, we start with N clusters of size 1; moving right, we end with 1 cluster size N.
Constraints on clustering
The extra constraints that can be placed on the clustering, as options
that go along with the --cluster option. (Note:
--matrix and --cluster can be performed at the same
time)
To only merge clusters that do not contain individuals differing at a
certain p-value:
--merge 0.0001
To ensure that every cluster has at least one case and one control:
--cc
To set the maximum cluster size to a certain value, e.g. 2:
--mc 2
Putting these together with some of the association options
will probably be the most common usuage of this cluster option. Some
of these are illustrated in the association
analysis section.
WARNING! The calculation of p-values for all pairs
assumes that all SNPs are uncorrelated, i.e. not too close together. (The
basic IBS clustering does not make this assumption). Therefore, if this
option is used, it should only be performed on a subset of independent
SNPs.
HINT Also, this test is susceptible to non-random missingness
in genotypes, particularly if heterozygotes are more likely to be dropped.
It is therefore good practice to set the --geno very high for
this analysis, e.g. so only SNPs with virtually complete genotyping are
included.
[an error occurred while processing this directive]
This document last modified [an error occurred while processing this directive]