The central information for both clustering and multi-dimensional
scaling is the pairwise identity-by-state (IBS) information for each
pair of individuals. This can be calculated in advance with
the --genome command and stored in a plink.genome
file and subsequently accessed for any clustering or MDS analysis with
the --read-genome command.
If available, using a parallel compute cluster for this step is
advisable (see the --genome-lists command described on the
It is advisable to filter SNPs based on linkage disequilibrium prior
to calculating genome-wide IBS. That is, it is probably not desirable
to have many SNPs from the same region, in tight LD, all contributing
to the IBS scores. At the extreme, such regions will influence the
clustering and MDS analyses (e.g. a dimension of the MDS could
represent local variation of a cluster of SNPs in LD).
SNPs should probably be filtered to include only high genotyping, high
confidence SNPs prior to this analysis also. For example, one might
apply filters (assuming that most other basic QC has already been
and then apply LD pruning:
--indep-pairwise 100 50 0.2
to get something in the order of 50,000 high-genotyping SNPs in
approximate linkage equilibrium to use in the genome/cluster analysis.
Then run the --genome command on this reduced set, to
calculate the plink.genome file.
Given the pre-calculated .genome file, the next step might be
to look for extreme outliers before running the main clustering/MDS
analysis. This option will detect pairs of individuals that are
appear much more similar (related/contaminated/duplicated) or
dissimilar compared to the rest of the sample:
Amongst others, the --cluster option produces the
file plink.cluster3 which we will use in the next step.
The --neighbour option produces the
file plink.nearest: the columns to focus on are Z
and NN==1 rows, looking for outliers (e.g. say more than 3
standard deviations). Plotting the results usually helps here. At this
stage, running the
procedure to calculate heterozygosity coefficients is also helpful,
similarly to look for outliers.
Any obvious outliers should most probably be removed from analysis at
And look at plink.mds file -- identify further outliers / any
evidence for 1 major cluster. Remove outliers; if major clusters,
lookup in previously calculated plink.cluster3 file (i.e. in R
d <- read.table("plink.mds",header=T)
k <- read.table("plink.cluster3")
Plot the MDS, selecting the appropriate column of 'k' to supply the
cluster solution K=X of N individuals (i.e. last col of k has K=1
solution, next to last K=2, etc, up to column 3 has K=N solution; we
add 1 to the colour code so we don't plot white on white.
plot( d$C1 , d$C2 , col = k[ , N+3-X ]+1 )
The absolute values of each dimension do not have any useful
Also, one might check whether any distinct clusters outside the main
one map to a specific plate, batch, etc. at this stage.
plot( d$C1 )
plot( d$C2 )
2b) Optionally: take each component of the MDS and treat as a
phenotype in a quantitative trait using all SNPs -- calculate lambda;
plot results, to see if a dimension represents a particular location
(i.e. lambda will be near 1). Helps determine whether that MDS
could/should be used as a covariate downstream.
3) Optionally: determine whether cases and controls are, on average,
comparably matched (via permutation: are phenotypically discordant
pairs less similar in IBS terms, on average?)
4) Run main clustering -- in this scenario with
seem reasonable parameters. Setting ppc higher means more but smaller
clusters; setting ppc lower means fewer but larger clusters.
for main analysis.
This document last modified