PLINK: Whole genome data analysis toolset
[an error occurred while processing this directive]
Basic usage / data formats
PLINK is a command line program written in C/C++. All
commands involve typing plink at the command prompt (e.g. DOS
window or Unix terminal) followed by a number of options (all starting
with --option) to specify the data files / methods
to be used.
All results are written to files with various extensions. The name of
the file is by default plink.ext where .ext will
change depending on the content of the file. Often these files will be
large: using a package such as R is suggested for visualising and
tabulating output. As such, as output files will be in standard
plain text 'rectangular' format, with one header row and a fixed
number of columns per line.
The reference section gives a complete list of all options and
output file types.
Running PLINKPLINK is a command-line program: clicking on an icon
will get you nowhere. Open up a command prompt or terminal window and
perform all analyses by typing commands as described below.
plink --file mydata
where we expect two files: in this case, mydata.ped and
mydata.map; alternatively, the PED and MAP files can be
specified separately, if they have different names:
plink --ped mydata.ped --map autosomal.map
Note Loading a large file (100K+ SNPs) can take a
while. PLINK will give an error message in most circumstances
when something has gone wrong.
PED/MAP files
The PED file is a white-space (space or tab) delimited file: the first six
columns are mandatory:
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female)
Phenotype
The IDs are alphanumeric: the combination of family and individual ID
should uniquely identify a person. A PED file must have 1 and only 1
phenotype in the sixth column. The phenotype can be either a
quantitative trait or an affection status column. Affection status is
coded:
0 missing
1 unaffected
2 affected
The missing phenotype value for quantitative traits is, by default, -9. It can be reset by including the --missing-phenotype option:
plink --file mydata --missing-phenotype 99
Genotypes (column 7 onwards) should also be white-space delimited; they
can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else), but all
markers should be biallelic. No header row should be given. For example,
here are two individuals typed for 3 SNPs (one row = one person):
FAM001 1 0 0 1 2 A A G G A C
FAM001 2 0 0 1 2 A A G G A C
...
Each line of the MAP file describes one markers and must contain exactly 4
columns:
chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position
HINT To exclude a SNP from analysis, set the 4th column
(physical base-pair position) to any negative value.
The MAP file must therefore contain as many markers as are in the PED
file. The markers in the PED file do not need to be in genomic order:
(i.e. the order MAP file should align with the order of the PED file
markers). For basic association testing, the genetic distance column
can be set at 0.
Binary PED files
To save space and time, you can make a binary ped file (*.bed). This
will store the pedigree/phenotype information in separate file (*.fam)
and create an extended MAP file (*.bim) (which contains information
about the allele names, which would otherwise be lost in the BED
file). To create these files
plink --file mydata --make-bed
which creates (by default)
plink.bed ( binary file, genotype information )
plink.fam ( first six columns of mydata.ped )
plink.bim ( extended MAP file: two extra cols = allele names)
The .fam and .bim files are still plain text files:
these can be viewed with a standard text editor. Do not try to view
the .bed file however: it is a compressed file and you'll
only see lots of strange characters on the screen...
You can specify a different output root file name (i.e. different to "plink")
by using the --output (or --out) option:
plink --file mydata --output mydata --make-bed
which will create
mydata.bed
mydata.fam
mydata.bim
To subsequently load a binary file, just use --bfile instead
of --file
plink --bfile mydata
HINT For large files, first creating a binary PED
file can save a lot of time and disk space!
When creating a binary ped file, the MAF and missingness filters are
set to include everybody and all SNPs. If you want to change these,
use --maf, --geno, etc, to manually specify these
options: for example,
plink --file mydata --make-bed --maf 0.02 --geno 0.1
Alternate phenotype files
To specify an alternate phenotype for analysis, i.e. other than the one in the PED/BED file, use the --pheno option:
plink --file mydata --pheno pheno.txt
where pheno.txt is a file that contains 3 columns (one row per
individual):
Family ID
Individual ID
Phenotype
If an individual is in the original file but not listed in the
alternate phenotype file, that person's phenotype will be set to
missing. If a person is in the alternate phenotype file but not in the
original file, that entry will be ignored. The order of the alternate
phenotype file need not be the same as for the original file.
If the phenotype file contains more than one phenotype, then use the --mpheno N option to specify the Nth phenotype is the one to be used:
plink --file mydata --pheno pheno2.txt --mpheno 4
where pheno2.txt contains 5 different phenotypes (i.e. 7
columns in total), this command will use the 4th for analysis (phenotype D):
Family ID
Individual ID
Phenotype A
Phenotype B
Phenotype C
Phenotype D
Phenotype E
Covariate filesTODO Covariates are not currently used in any analysis:
when the GxE option is fully functional, this will be the way to
specify the environmental term.
To load a covariate use the option:
plink --file mydata --covar c.txt
In a similar manner to the alternate phenotype file, described above,
an --mcovar N option can be used to select the Nth
covariate if the file contains more than one.
The covariate file should be formatted in a similar manner to the
phenotype file.
Warning! Missing values are not supported for covariates.
Cluster files
To load a cluster solution, which will be used in the permutation
procedure, use the --within-file option:
plink --file mydata --within-file f.txt
See the section on association analysis to see how this feature can be
used to control for potential confounders in analysis (by only
permuting within groups).
This file should have a similar structure to the alternate phenotype
file. The clusters should be numerically coded:
TODO Specify behavior if an individual is not listed in
the cluster file: specify numeric coding details.
[an error occurred while processing this directive]
This document last modified [an error occurred while processing this directive]