PLINK: Whole genome data analysis toolset

PLINK: Whole genome data analysis toolset [an error occurred while processing this directive]

Basic usage / data formats

PLINK is a command line program written in C/C++. All commands involve typing plink at the command prompt (e.g. DOS window or Unix terminal) followed by a number of options (all starting with --option) to specify the data files / methods to be used.

All results are written to files with various extensions. The name of the file is by default plink.ext where .ext will change depending on the content of the file. Often these files will be large: using a package such as R is suggested for visualising and tabulating output. As such, as output files will be in standard plain text 'rectangular' format, with one header row and a fixed number of columns per line. The reference section gives a complete list of all options and output file types.

Running PLINK

PLINK is a command-line program: clicking on an icon will get you nowhere. Open up a command prompt or terminal window and perform all analyses by typing commands as described below.

plink --file mydata

where we expect two files: in this case, mydata.ped and mydata.map; alternatively, the PED and MAP files can be specified separately, if they have different names:

plink --ped mydata.ped --map autosomal.map

Note Loading a large file (100K+ SNPs) can take a while. PLINK will give an error message in most circumstances when something has gone wrong.

PED/MAP files

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:

     Family ID
     Individual ID
     Paternal ID
     Maternal ID
     Sex (1=male; 2=female)
     Phenotype

The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an affection status column. Affection status is coded:

     0 missing
     1 unaffected
     2 affected

The missing phenotype value for quantitative traits is, by default, -9. It can be reset by including the --missing-phenotype option:

plink --file mydata --missing-phenotype 99

Genotypes (column 7 onwards) should also be white-space delimited; they can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else), but all markers should be biallelic. No header row should be given. For example, here are two individuals typed for 3 SNPs (one row = one person):

     FAM001  1  0 0  1  2  A A  G G  A C 
     FAM001  2  0 0  1  2  A A  G G  A C 
     ...

Each line of the MAP file describes one markers and must contain exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced)
     rs# or snp identifier
     Genetic distance (morgans)
     Base-pair position

HINT To exclude a SNP from analysis, set the 4th column (physical base-pair position) to any negative value.

     1  rs123456  0  1234555
     1  rs234567  0  1237793
     1  rs224534  0  -1237697        <-- exclude this SNP
     1  rs233556  0  1337456
     ...

The MAP file must therefore contain as many markers as are in the PED file. The markers in the PED file do not need to be in genomic order: (i.e. the order MAP file should align with the order of the PED file markers). For basic association testing, the genetic distance column can be set at 0.

Binary PED files

To save space and time, you can make a binary ped file (*.bed). This will store the pedigree/phenotype information in separate file (*.fam) and create an extended MAP file (*.bim) (which contains information about the allele names, which would otherwise be lost in the BED file). To create these files

plink --file mydata --make-bed

which creates (by default)

     plink.bed      ( binary file, genotype information )
     plink.fam      ( first six columns of mydata.ped ) 
     plink.bim      ( extended MAP file: two extra cols = allele names)

The .fam and .bim files are still plain text files: these can be viewed with a standard text editor. Do not try to view the .bed file however: it is a compressed file and you'll only see lots of strange characters on the screen...

You can specify a different output root file name (i.e. different to "plink") by using the --output (or --out) option:

plink --file mydata --output mydata --make-bed

which will create

     mydata.bed
     mydata.fam
     mydata.bim

To subsequently load a binary file, just use --bfile instead of --file

plink --bfile mydata

HINT For large files, first creating a binary PED file can save a lot of time and disk space!

When creating a binary ped file, the MAF and missingness filters are set to include everybody and all SNPs. If you want to change these, use --maf, --geno, etc, to manually specify these options: for example,

plink --file mydata --make-bed --maf 0.02 --geno 0.1

Alternate phenotype files

To specify an alternate phenotype for analysis, i.e. other than the one in the PED/BED file, use the --pheno option:

plink --file mydata --pheno pheno.txt

where pheno.txt is a file that contains 3 columns (one row per individual):

     Family ID
     Individual ID
     Phenotype

If an individual is in the original file but not listed in the alternate phenotype file, that person's phenotype will be set to missing. If a person is in the alternate phenotype file but not in the original file, that entry will be ignored. The order of the alternate phenotype file need not be the same as for the original file.

If the phenotype file contains more than one phenotype, then use the --mpheno N option to specify the Nth phenotype is the one to be used:

plink --file mydata --pheno pheno2.txt --mpheno 4

where pheno2.txt contains 5 different phenotypes (i.e. 7 columns in total), this command will use the 4th for analysis (phenotype D):

     Family ID
     Individual ID
     Phenotype A
     Phenotype B
     Phenotype C
     Phenotype D
     Phenotype E

Covariate files

TODO Covariates are not currently used in any analysis: when the GxE option is fully functional, this will be the way to specify the environmental term.

To load a covariate use the option:

plink --file mydata --covar c.txt

In a similar manner to the alternate phenotype file, described above, an --mcovar N option can be used to select the Nth covariate if the file contains more than one.

The covariate file should be formatted in a similar manner to the phenotype file.

Warning! Missing values are not supported for covariates.

Cluster files

To load a cluster solution, which will be used in the permutation procedure, use the --within-file option:

plink --file mydata --within-file f.txt

See the section on association analysis to see how this feature can be used to control for potential confounders in analysis (by only permuting within groups).

This file should have a similar structure to the alternate phenotype file. The clusters should be numerically coded:

Here, individuals would be grouped:

     Cluster 1: F1/I1  F2/I1   F5/I1
     Cluster 2: F3/I1  F4/I1
     Cluster 3: F6/I1  F7/I1
     ...

TODO Specify behavior if an individual is not listed in the cluster file: specify numeric coding details. [an error occurred while processing this directive]

This document last modified [an error occurred while processing this directive]