2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
22. LD-based results clumping
23. Gene-based report
25. Rare CNVs
26. Common CNPs
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
35. FAQ & Hints
Resources available for download
This page contains links to several freely-available resources, mostly
generated by other individuals. All these resources are provided "as
is", without any guarantees regarding their correctness or utility.
The Phase 2 HapMap as a PLINK fileset
The HapMap genotype data (the latest is release
23) are available here as PLINK binary filesets. The SNPs are
currently coded according NCBI build 36 coordinates on the forward
strand. Several versions are available here: the entire dataset (a
single, very large fileset: you will need a computer with at least 2Gb
of RAM to load this file).
The filtered SNP set refers to a list of SNPs that have MAF
greater than 0.01 and genotyping rate greater than 0.95 in the 60 CEU
founders. This fileset is probably a good starting place for
imputation in samples of European descent. Filtered versions of the
other HapMap panels will be made available shortly.
|Entire HapMap (release 23, 270 individuals, 3.96 million SNPs)
|CEU (release 23, 90 individuals, 3.96 million SNPs)
|YRI (release 23, 90 individuals, 3.88 million SNPs)
|JPT+CHB (release 23, 90 individuals, 3.99 million SNPs)
|CEU founders (release 23, 60 individuals, filtered 2.3 million SNPs)
|YRI founders (release 23, 60 individuals, filtered 2.6 million SNPs)
|JPT+CHB founders (release 23, 90 individuals, filtered 2.2 million SNPs)
|Entire HapMap (release 22, 270 individuals, 3.96 million SNPs)
|CEU founders (release 22, 60 individuals, 3.96 million SNPs)
|CEU founders (release 22, 60 individuals, filtered 2.2 million SNPs)
|CEU founders (release 22, as above, files split by chromosome, 1-22 and X)
|Hapmap individuals with population information ( FID, IID, POP )
Teaching materials and example dataset
A tutorial can be downloaded from here; the material is similar to the
online tutorial but slightly more involved. As it currently stands, it
is designed to first use gPLINK to perform a set of basic
tests and QC procedures and then move to standard PLINK for
more in-depth analysis.
It is designed to work on a standard modern laptop computer or
equivalent desktop. It was written for vesion 1.02 of PLINK, but
should remain compatible with future releases.
You are feel free to use, modify or distribute these files in any way
you wish, although giving me appropriate credit for the materials
would be appreciated.
The example.zip archive contains
|ZIP archive containing data
|ZIP archive containing teaching materials
wgas1.ped Whole-genome SNP data example PED file
wgas1.map Corresponding MAP file
extra.ped Follow-up genotyping for a particular region
extra.map Corresponding MAP file
pop.cov Population membership variable
command-list.txt List of all commands for 2nd part of practical
The teaching.zip archive contains a PowerPoint and a Word file:
These two files cover the first and second half of the tutorial
respectively. The second document assumes the first half has already
been completed (but also contains some introductory remarks concerning
the data). I will probably update the Word document to also include
the early commands covered in the PowerPoint/gPLINK part (i.e. so that
the entire practical can be performed from the command line rather
than using gPLINK). The list of commands (command-list.txt) is
included so that people can cut-and-paste commands in, rather than type. If
using DOS, it is a good idea to first increase the window width (right click on
header on DOS window, Properties, Layout and increase buffer and window width to
around 120 characters).
Everything should be fairly self-explantory after looking through the PowerPoint file
and Word document.
Multimarker test lists
These files, generated by Itsik Pe'er and others, facilitate the
'multi-marker predictor' approach to association testing, as described
in the manusctipt:
Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D
& Daly MJ (2006) Evaluating and improving power in whole-genome
association studies using fixed marker sets. Nat Genet, 38(6): 605-6.
They are PLINK-formatted lists of multimarker tests selected for
Affymetrix 500K and Illumina whole genome products, based on
consideration of the CEU Phase 2 HapMap (at r-squared=0.8
threshold). One should download the appropriate file and run with
the --hap option (after ensuring that any strand issues have
Note These haplotypes are specified in terms
of the +ve (positive) strand relative to the HapMap. You might need to
reformat your data prior to using these files (using the
--flip command, for instance) before you can use them.
Note These tables list all tags for every common HapMap
SNP, at the given r-squared threshold. The same haplotype may therefore
appear multiple times (i.e. if it tags more than 1 SNP).
Note These tables obviously assume that all tags on present in
the final, post-quality-control dataset: i.e. if certain SNPs have been removed,
it will be better to reselect the predictors -- that is, these lists should really
only be used as a first pass, for convenience.
In general, however, quite possibily an easier and better strategy is
instead to analyse the data within
an imputation context, e.g. utilising
the proxy association procedures rather than using these fixed lists.
NOTE The gene range lists below have replaced this old gene SET file:
you are advised to use the lists below rather than this file.
Here is a PLINK-format SET
file, containing a genome-wide set of genes (N=18272). The
co-ordinates are based on NCBI B36 assembly, dbSNP 126; a gene is
arbitrarily defined as including 50kb upstream and downstream.
Download (ZIP archive):
Gene range lists
These are gene lists: files containing lists of genes, based on either
hg17 or hg18 co-ordinates. The format is one gene per row,
Start position (bp)
Stop position (bp)
These lists can be used with PLINK commands such as
--make-set, --range, --gene-list,
--cnv-intersect, --clump-range, etc.
These gene lists were downloaded from UCSC table browser for all
RefSeq genes on July 24th 2008. Overlapping isoforms of the same gene
were combined to form a single full length version of the gene.
Isoforms that didn't overlap were left as duplicates of that gene.
Rather than using the gene sets
(described above), we suggest using these
gene lists to make gene sets on the fly
(using --make-set-border if so desired, to add a fixed kb
border on the fly).
Gene list (hg18): glist-hg18
Gene list (hg17): glist-hg17
Functional SNP attributes
This file contains a list of codes to indicate the functional status of SNPs. It is designed to be
used in conjunction with the --annotate command.
This file was created as follows: we downloaded all data from dbSNP,
build 129, and extracted lists of SNPs that are nonsense, frameshift,
missense or splice-site variants. We intersected this list with the
SNPs available in the Phase 2 CEU HapMap dataset, and selected lists
of SNPs that strongly tagged this functional SNPs (r-sq above 0.5; MAF
above 0.01). For each HapMap SNP that either is or tags a functional
SNP, we created an entry in the file below. Here upper-case represents
that that SNP is a coding SNP in HapMap; lower-case represents that
the SNP is in strong LD with a coding variant, in HapMap.
In future, we will post revised attribute files, to include more
annotations, and information (e.g. such as a version with the rs ID of
the functional SNP(s) that is tagged).
SNP attributes: snp129.attrib.gz
To use the file with the --annotate command, for example:
plink --annotate myresults.txt attrib=snp129.attrib.gz
(You can use gunzip, or WinZip, to decompress this file.)