1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Familybased association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Metaanalysis
21. Annotation
22. LDbased results clumping
23. Genebased report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. Rplugins
28. Annotation weblookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flowchart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK


Population stratification: notes of use and interpretation
Step 1. Precalculate the .genome file
The central information for both clustering and multidimensional
scaling is the pairwise identitybystate (IBS) information for each
pair of individuals. This can be calculated in advance with
the genome command and stored in a plink.genome
file and subsequently accessed for any clustering or MDS analysis with
the readgenome command.
If available, using a parallel compute cluster for this step is
advisable (see the genomelists command described on the
main page).
It is advisable to filter SNPs based on linkage disequilibrium prior
to calculating genomewide IBS. That is, it is probably not desirable
to have many SNPs from the same region, in tight LD, all contributing
to the IBS scores. At the extreme, such regions will influence the
clustering and MDS analyses (e.g. a dimension of the MDS could
represent local variation of a cluster of SNPs in LD).
SNPs should probably be filtered to include only high genotyping, high
confidence SNPs prior to this analysis also. For example, one might
apply filters (assuming that most other basic QC has already been
done) of
maf 0.01
geno 0.01
mind 1
and then apply LD pruning:
indeppairwise 100 50 0.2
to get something in the order of 50,000 highgenotyping SNPs in
approximate linkage equilibrium to use in the genome/cluster analysis.
Then run the genome command on this reduced set, to
calculate the plink.genome file.
Step 2. Look for outliers and obtain full cluster solution
Given the precalculated .genome file, the next step might be
to look for extreme outliers before running the main clustering/MDS
analysis. This option will detect pairs of individuals that are
appear much more similar (related/contaminated/duplicated) or
dissimilar compared to the rest of the sample:
readgenome plink.genome
cluster
neighbour 1 5
Amongst others, the cluster option produces the
file plink.cluster3 which we will use in the next step.
The neighbour option produces the
file plink.nearest: the columns to focus on are Z
and NN==1 rows, looking for outliers (e.g. say more than 3
standard deviations). Plotting the results usually helps here. At this
stage, running the
het
procedure to calculate heterozygosity coefficients is also helpful,
similarly to look for outliers.
Any obvious outliers should most probably be removed from analysis at
this stage.
Step 3. Visualise the data using MDS, potentially determining the number of clusters
cluster
mdsplot 6
And look at plink.mds file  identify further outliers / any
evidence for 1 major cluster. Remove outliers; if major clusters,
lookup in previously calculated plink.cluster3 file (i.e. in R
d < read.table("plink.mds",header=T)
k < read.table("plink.cluster3")
Plot the MDS, selecting the appropriate column of 'k' to supply the
cluster solution K=X of N individuals (i.e. last col of k has K=1
solution, next to last K=2, etc, up to column 3 has K=N solution; we
add 1 to the colour code so we don't plot white on white.
plot( d$C1 , d$C2 , col = k[ , N+3X ]+1 )
The absolute values of each dimension do not have any useful
interpretation.
Also, one might check whether any distinct clusters outside the main
one map to a specific plate, batch, etc. at this stage.
plot( d$C1 )
plot( d$C2 )
2b) Optionally: take each component of the MDS and treat as a
phenotype in a quantitative trait using all SNPs  calculate lambda;
plot results, to see if a dimension represents a particular location
(i.e. lambda will be near 1). Helps determine whether that MDS
could/should be used as a covariate downstream.
3) Optionally: determine whether cases and controls are, on average,
comparably matched (via permutation: are phenotypically discordant
pairs less similar in IBS terms, on average?)
readgenome plink.genome
ibstest
4) Run main clustering  in this scenario with
ppc 1e3
cc
seem reasonable parameters. Setting ppc higher means more but smaller
clusters; setting ppc lower means fewer but larger clusters.
5) Then
mh within
for main analysis.
g.

