PLINK/SEQ 101
In this introduction to PLINK/SEQ, we will use one toy VCF dataset and PSEQ, pbrowse and the R interface to view, filter and summarise these data. First ensure that PSEQ is properly installed on your system, by typing the following at the command prompt:
pseq help
This should produce a list of the help topics that are available:
usage:pseq {project-file|VCF|-|.} {command} {--options} Command groups --------------------------------------------------------- input Data input output Variant data output project Project functions ...
A toy VCF file
Here we consider a very simple toy project, that consists of data on 5 individuals and 3 variants from a single VCF called ex1.vcf:
As described more fully here a VCF is a text file (that can be gzip-compressed) that contains:
- rows of meta-information (starting ##) that in real datasets often give information about the data, how they were produced, and what the various tags or attributes that appear with the variants and genotypes mean. Common tags may indicate read-depth, or presence in dbSNP, for example; here, we have only dummy tags labelled VM, SM and GM (that do not convey any useful information, other than to illustrate how these fields are handled by PLINK/Seq).
- a single header row (starting #CHROM) that always contains the first eight fields that describe the variant at a given position; typically, a VCF will also contain genotype data for specific individuals too, in which case their unique identifiers (IDs) are listed (P001, P002, etc).
- all subsequent rows contain the actual data for the variant and
genotype calls on each individual. For example, the first row
indicates a T>C SNP at position 1001 on chromosome 1; in this example,
individual P003 is heterozygous (numerically
encoded 0/1 where 0 if the reference allele, 1 is the first
alternate allele) for the variant, with an associated meta-attribute
value of 3 for GM, as dictated by the FORMAT field
GT:GM, where GT is the standard keyword for
genotype.
PSEQ command-line tool
The PSEQ command line tool is the primary interface of the PLINK/Seq library. PSEQ commands take the basic form:
pseq input-source command {--arguments}
The input-source element specifies where the data come from and usually points to a VCF file or a PLINK/Seq project, as illustrated below.
Each PSEQ command takes a single command, which specifies the type of operation to be performed. For example, the command v-view simply lists some basic pieces of information about each variant (the fields in the output are described here):
pseq ex1.vcf v-view
chr1:1001 rs1001 T/C . 1 PASS chr1:1002 rs1002 G/A . 1 PASS chr1:1003 rs1003 G/A . 1 PASS
Often, the basic behaviour of a command can be modified by one or more arguments. For example, to view the full set of genotype and meta-information for the variants, one would add the following:
pseq ex1.vcf v-view --vmeta --gmeta
chr1:1001 rs1001 T/C . 1 PASS . VM=1;SM=100 P001 1 C/C [GM=1] P002 1 T/T [GM=2] P003 1 T/C [GM=3] P004 1 C/C [GM=4] P005 1 C/C [GM=5] chr1:1002 rs1002 G/A . 1 PASS . VM=2;SM=101 P001 1 A/A [GM=6] P002 1 G/A [GM=7] P003 1 G/A [GM=8] P004 1 A/A [GM=9] P005 1 A/A [GM=10] chr1:1003 rs1003 G/A . 1 PASS . VM=3;SM=102 P001 1 G/G [GM=11] P002 1 G/G [GM=12] P003 1 G/A [GM=13] P004 1 A/G [GM=14] P005 1 G/G [GM=15]
Another very common type of optional argument is a mask. These are effectively filters that operate on variants, genotypes, individuals or files, either to include, exclude, group or modify the data in various ways. For example, this mask pulls out only a single variant (at position 1002) for three of the five individuals, and sets genotypes to missing (./.) if the GM tag is above a certain value (single line split for viewing):
pseq ex1.vcf v-view --gmeta
--mask reg=chr1:1002 include="gf(GM>8)" indiv=P001,P003,P005chr1:1002 rs1002 G/A . 1 PASS P001 1 ./. [GM=6] P003 1 ./. [GM=8] P005 1 A/A [GM=10]
Other commands perform different types of actions: in this example, the counts provides the number of minor alleles per variant:
pseq ex1.vcf counts
VAR REF/ALT MINOR CNT TOT chr1:1001 T/C T 3 10 chr1:1002 G/A G 2 10 chr1:1003 G/A A 2 10
For many purposes it is useful to work with a full project rather than a single VCF file. We first create a new PLINK/SEQ project called proj1:
pseq proj1 new-project
and then load the VCF into this project:
pseq proj1 load-vcf --vcf ex1.vcf
To check that the data have loaded as expected:
pseq proj1 var-summary
---Variant DB summary--- 3 unique variants File tag : 1 (3 variants, 5 individuals)
Now, the same commands performed above can applied to the project instead of the VCF: i.e.
pseq proj1 counts
should yield the same results as above.
pbrowse interactive viewer
Having created a project, it is now possible to use other PLINK/Seq tools with these data. If pbrowse is installed on your system (it should be following a standard installation) and assuming you are working in a graphical environment that can open a web-browser, the following command should open up this project in the browser:
pbrowse proj1
Clicking on Fetch will list all variants in the project. Clicking view on a specific variant will list the individual genotypes for that variant. You can also see these same views via the copy of pbrowse installed on this web-server for this project: for example, the variant list and individual genotypes.
Rplinkseq
Finally, if R is installed on your machine, and you've installed the Rplinkseq R library version of PLINK/Seq, then you should be able to start R, load the Rplinkseq library and attach this project so that it will be visible: (typing at the R command prompt)
library(Rplinkseq)
PLINK/Seq genetics library for R | v0.08 | 15-Mar-2012
pseq.project( "proj1" )
Now you can bring parts of the project data into R (for example, here, the entire 3 by 5 matrix of alternate allele counts for variants by individuals)
k <- var.fetch()
x.consensus.genotype( k )
[,1] [,2] [,3] [,4] [,5] [1,] 2 0 1 2 2 [2,] 2 1 1 2 2 [3,] 0 0 1 1 0
or apply your own functions across the dataset (here, simply listing the position and reference/alternate alleles for each variant):
f <- function(v) { cat(v$CHR,":",v$BP1,v$CON$REF,v$CON$ALT,"\n") }
var.iterate( f )
1 : 1001 T C 1 : 1002 G A 1 : 1003 G A
Please read the rest of this website for more information on these tools.