This page contains links to several freely-available resources, mostly generated by other individuals. All these resources are provided "as is", without any guarantees regarding their correctness or utility.

The Phase 2 HapMap as a PLINK fileset

The HapMap genotype data (the latest is release 23) are available here as PLINK binary filesets. The SNPs are currently coded according NCBI build 36 coordinates on the forward strand. Several versions are available here: the entire dataset (a single, very large fileset: you will need a computer with at least 2Gb of RAM to load this file).

The filtered SNP set refers to a list of SNPs that have MAF greater than 0.01 and genotyping rate greater than 0.95 in the 60 CEU founders. This fileset is probably a good starting place for imputation in samples of European descent. Filtered versions of the other HapMap panels will be made available shortly.

Description	File size	File name
Entire HapMap (release 23, 270 individuals, 3.96 million SNPs)	120M	hapmap_r23a.zip
CEU (release 23, 90 individuals, 3.96 million SNPs)	59M	hapmap_CEU_r23a.zip
YRI (release 23, 90 individuals, 3.88 million SNPs)	65M	hapmap_YRI_r23a.zip
JPT+CHB (release 23, 90 individuals, 3.99 million SNPs)	58M	hapmap_JPT_CHB_r23a.zip
CEU founders (release 23, 60 individuals, filtered 2.3 million SNPs)	31M	hapmap_CEU_r23a_filtered.zip
YRI founders (release 23, 60 individuals, filtered 2.6 million SNPs)	38M	hapmap_YRI_r23a_filtered.zip
JPT+CHB founders (release 23, 90 individuals, filtered 2.2 million SNPs)	33M	hapmap_JPT_CHB_r23a_filtered.zip

Description	File size	File name
Entire HapMap (release 22, 270 individuals, 3.96 million SNPs)	110M	hapmap_r22.zip
CEU founders (release 22, 60 individuals, 3.96 million SNPs)	49M	hapmap-ceu-all.zip
CEU founders (release 22, 60 individuals, filtered 2.2 million SNPs)	29M	hapmap-ceu.zip
CEU founders (release 22, as above, files split by chromosome, 1-22 and X)	29M	hapmap-ceu-by-chr.zip

Description	File name
Hapmap individuals with population information ( FID, IID, POP )	hapmap.pop

Teaching materials and example dataset

A tutorial can be downloaded from here; the material is similar to the online tutorial but slightly more involved. As it currently stands, it is designed to first use gPLINK to perform a set of basic tests and QC procedures and then move to standard PLINK for more in-depth analysis.

It is designed to work on a standard modern laptop computer or equivalent desktop. It was written for vesion 1.02 of PLINK, but should remain compatible with future releases.

Description	File size	File name
ZIP archive containing data	15M	example.zip
ZIP archive containing teaching materials	1.3M	teaching.zip

You are feel free to use, modify or distribute these files in any way you wish, although giving me appropriate credit for the materials would be appreciated.

The example.zip archive contains

     wgas1.ped              Whole-genome SNP data example PED file
     wgas1.map              Corresponding MAP file
     extra.ped              Follow-up genotyping for a particular region
     extra.map              Corresponding MAP file
     pop.cov                Population membership variable
     command-list.txt       List of all commands for 2nd part of practical

The teaching.zip archive contains a PowerPoint and a Word file:

     practical-1-slides.ppt
     practical-2-notes.doc

These two files cover the first and second half of the tutorial respectively. The second document assumes the first half has already been completed (but also contains some introductory remarks concerning the data). I will probably update the Word document to also include the early commands covered in the PowerPoint/gPLINK part (i.e. so that the entire practical can be performed from the command line rather than using gPLINK). The list of commands (command-list.txt) is included so that people can cut-and-paste commands in, rather than type. If using DOS, it is a good idea to first increase the window width (right click on header on DOS window, Properties, Layout and increase buffer and window width to around 120 characters).

Everything should be fairly self-explantory after looking through the PowerPoint file and Word document.

Multimarker test lists

These files, generated by Itsik Pe'er and others, facilitate the 'multi-marker predictor' approach to association testing, as described in the manusctipt:

     Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D 
     & Daly MJ (2006) Evaluating and improving power in whole-genome 
     association studies using fixed marker sets. Nat Genet, 38(6): 605-6.

They are PLINK-formatted lists of multimarker tests selected for Affymetrix 500K and Illumina whole genome products, based on consideration of the CEU Phase 2 HapMap (at r-squared=0.8 threshold). One should download the appropriate file and run with the --hap option (after ensuring that any strand issues have been resolved).

Note These haplotypes are specified in terms of the +ve (positive) strand relative to the HapMap. You might need to reformat your data prior to using these files (using the --flip command, for instance) before you can use them.

Note These tables list all tags for every common HapMap SNP, at the given r-squared threshold. The same haplotype may therefore appear multiple times (i.e. if it tags more than 1 SNP).

Note These tables obviously assume that all tags on present in the final, post-quality-control dataset: i.e. if certain SNPs have been removed, it will be better to reselect the predictors -- that is, these lists should really only be used as a first pass, for convenience.

In general, however, quite possibily an easier and better strategy is instead to analyse the data within an imputation context, e.g. utilising the proxy association procedures rather than using these fixed lists.

Gene sets

NOTE The gene range lists below have replaced this old gene SET file: you are advised to use the lists below rather than this file.

Here is a PLINK-format SET file, containing a genome-wide set of genes (N=18272). The co-ordinates are based on NCBI B36 assembly, dbSNP 126; a gene is arbitrarily defined as including 50kb upstream and downstream.

Download (ZIP archive): gene-list.zip

Gene range lists

These are gene lists: files containing lists of genes, based on either hg17 or hg18 co-ordinates. The format is one gene per row,

   Chromosome
   Start position (bp)
   Stop position (bp)
   Gene name

These lists can be used with PLINK commands such as --make-set, --range, --gene-list, --cnv-intersect, --clump-range, etc.

These gene lists were downloaded from UCSC table browser for all RefSeq genes on July 24th 2008. Overlapping isoforms of the same gene were combined to form a single full length version of the gene. Isoforms that didn't overlap were left as duplicates of that gene.

Rather than using the gene sets (described above), we suggest using these gene lists to make gene sets on the fly (using --make-set-border if so desired, to add a fixed kb border on the fly).

Gene list (hg18): glist-hg18
Gene list (hg17): glist-hg17

Functional SNP attributes

This file contains a list of codes to indicate the functional status of SNPs. It is designed to be used in conjunction with the --annotate command.

This file was created as follows: we downloaded all data from dbSNP, build 129, and extracted lists of SNPs that are nonsense, frameshift, missense or splice-site variants. We intersected this list with the SNPs available in the Phase 2 CEU HapMap dataset, and selected lists of SNPs that strongly tagged this functional SNPs (r-sq above 0.5; MAF above 0.01). For each HapMap SNP that either is or tags a functional SNP, we created an entry in the file below. Here upper-case represents that that SNP is a coding SNP in HapMap; lower-case represents that the SNP is in strong LD with a coding variant, in HapMap.

     =NONSENSE        =nonsense
     =MISSENSE        =missense
     =FRAMESHIFT      =frameshift
     =SPLICE          =splice

In future, we will post revised attribute files, to include more annotations, and information (e.g. such as a version with the rs ID of the functional SNP(s) that is tagged).

SNP attributes: snp129.attrib.gz

To use the file with the --annotate command, for example:

    plink --annotate myresults.txt  attrib=snp129.attrib.gz

(You can use gunzip, or WinZip, to decompress this file.)

This document last modified Wednesday, 25-Jan-2017 11:39:28 EST