PLINK: Whole genome data analysis toolset [an error occurred while processing this directive]
Population stratification
A simple but powerful approach to population stratification is included in PLINK, that can use whole genome SNP data in a computationally efficient manner. We use complete linkage agglomerative clustering, based on pairwise IBS distance, but with some modifications to the clustering process: restrictions based on a significance test for whether two individuals belong to the same population (i.e. do not merge clusters that contain significantly different individuals) , a phenotype criterion (i.e. all pairs must contain at least one case and one control) and a cluster size restriction (i.e. such that, with a cluster size of 2, for example, the subsequent association test would implicitly match every case with its nearest control, as long as the case and control do not show evidence of belonging to different populations). Any evidence of population substructure (from this or any other analysis) can be incorporated in subsequent association tests via the specification of clusters, as at each permutation step of the tests described below, individuals are only permuted within cluster.

All these analyses require a large number of SNPs!

IBS distance matrix

To create matrix of IBS pairwise distances

plink --file mydata --matrix

creates two files
which contain the IBS distances and p-value tests for all pairs of individuals.

HINT See the FAQ page for instructions on using using R to visualise these results.

IBS clustering

To perform the clustering:

plink --file mydata --cluster

The output is sent to four files
that contain similar information but in different formats. The

The cluster0 file contains information on the merging process.

The cluster1 file contains information on the final solution, listed by cluster: e.g. for 4 individuals in 3 clusters:
     0   A
     1   B C
     2   D

The cluster2 file contains the same information but listed by individual:
     A 0
     B 1
     C 1
     D 2

The cluster3 file is in the same format as cluster2 but contains all solutions, each column is a solution: e.g.
     A 0 0 0 0
     B 1 1 1 0
     C 2 1 1 0
     D 3 2 0 0
i.e. reading from left to right, we start with N clusters of size 1; moving right, we end with 1 cluster size N.

Constraints on clustering

The extra constraints that can be placed on the clustering, as options that go along with the --cluster option. (Note: --matrix and --cluster can be performed at the same time)

To only merge clusters that do not contain individuals differing at a certain p-value:

--merge 0.0001

To ensure that every cluster has at least one case and one control:


To set the maximum cluster size to a certain value, e.g. 2:

--mc 2

Putting these together with some of the association options will probably be the most common usuage of this cluster option. Some of these are illustrated in the association analysis section.

WARNING! The calculation of p-values for all pairs assumes that all SNPs are uncorrelated, i.e. not too close together. (The basic IBS clustering does not make this assumption). Therefore, if this option is used, it should only be performed on a subset of independent SNPs.

HINT Also, this test is susceptible to non-random missingness in genotypes, particularly if heterozygotes are more likely to be dropped. It is therefore good practice to set the --geno very high for this analysis, e.g. so only SNPs with virtually complete genotyping are included.

[an error occurred while processing this directive]
This document last modified [an error occurred while processing this directive]