PLINK/SEQ 101

In this introduction to PLINK/SEQ, we will use one toy VCF dataset and PSEQ, pbrowse and the R interface to view, filter and summarise these data. First ensure that PSEQ is properly installed on your system, by typing the following at the command prompt:

pseq help

This should produce a list of the help topics that are available:

  usage:pseq {project-file|VCF|-|.} {command} {--options}

  Command groups
  ---------------------------------------------------------
  input      Data input
  output     Variant data output
  project    Project functions
  ...

A toy VCF file

Here we consider a very simple toy project, that consists of data on 5 individuals and 3 variants from a single VCF called ex1.vcf:

##fileformat=VCFv4.1 ##FORMAT=<ID=GM,Number=1,Type=Integer,Description="Genotype meta"> ##INFO=<ID=VM,Number=1,Type=Integer,Description="Variant meta"> ##INFO=<ID=SM,Number=1,Type=Integer,Description="SampleVariant meta"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P001 P002 P003 P004 P005 1 1001 rs1001 T C 999 . VM=1;SM=100 GT:GM 1/1:1 0/0:2 0/1:3 1/1:4 1/1:5 1 1002 rs1002 G A 999 . VM=2;SM=101 GT:GM 1/1:6 0/1:7 0/1:8 1/1:9 1/1:10 1 1003 rs1003 G A 999 . VM=3;SM=102 GT:GM 0/0:11 0/0:12 0/1:13 1/0:14 0/0:15

As described more fully here a VCF is a text file (that can be gzip-compressed) that contains:

  • rows of meta-information (starting ##) that in real datasets often give information about the data, how they were produced, and what the various tags or attributes that appear with the variants and genotypes mean. Common tags may indicate read-depth, or presence in dbSNP, for example; here, we have only dummy tags labelled VM, SM and GM (that do not convey any useful information, other than to illustrate how these fields are handled by PLINK/Seq).

  • a single header row (starting #CHROM) that always contains the first eight fields that describe the variant at a given position; typically, a VCF will also contain genotype data for specific individuals too, in which case their unique identifiers (IDs) are listed (P001, P002, etc).

  • all subsequent rows contain the actual data for the variant and genotype calls on each individual. For example, the first row indicates a T>C SNP at position 1001 on chromosome 1; in this example, individual P003 is heterozygous (numerically encoded 0/1 where 0 if the reference allele, 1 is the first alternate allele) for the variant, with an associated meta-attribute value of 3 for GM, as dictated by the FORMAT field GT:GM, where GT is the standard keyword for genotype.

    PSEQ command-line tool

    The PSEQ command line tool is the primary interface of the PLINK/Seq library. PSEQ commands take the basic form:

    pseq input-source command {--arguments}

    The input-source element specifies where the data come from and usually points to a VCF file or a PLINK/Seq project, as illustrated below.

    Each PSEQ command takes a single command, which specifies the type of operation to be performed. For example, the command v-view simply lists some basic pieces of information about each variant (the fields in the output are described here):

    pseq ex1.vcf v-view
      chr1:1001  rs1001  T/C  .  1  PASS
      chr1:1002  rs1002  G/A  .  1  PASS
      chr1:1003  rs1003  G/A  .  1  PASS
    

    Often, the basic behaviour of a command can be modified by one or more arguments. For example, to view the full set of genotype and meta-information for the variants, one would add the following:

    pseq ex1.vcf v-view --vmeta --gmeta
      chr1:1001  rs1001    T/C   .   1   PASS   .   VM=1;SM=100
      P001  1  C/C [GM=1]
      P002  1  T/T [GM=2]
      P003  1  T/C [GM=3]
      P004  1  C/C [GM=4]
      P005  1  C/C [GM=5]
      chr1:1002  rs1002    G/A   .   1   PASS   .   VM=2;SM=101
      P001  1  A/A [GM=6]
      P002  1  G/A [GM=7]
      P003  1  G/A [GM=8]
      P004  1  A/A [GM=9]
      P005  1  A/A [GM=10]
      chr1:1003  rs1003    G/A   .   1   PASS   .   VM=3;SM=102
      P001  1  G/G [GM=11]
      P002  1  G/G [GM=12]
      P003  1  G/A [GM=13]
      P004  1  A/G [GM=14]
      P005  1  G/G [GM=15]
    

    Another very common type of optional argument is a mask. These are effectively filters that operate on variants, genotypes, individuals or files, either to include, exclude, group or modify the data in various ways. For example, this mask pulls out only a single variant (at position 1002) for three of the five individuals, and sets genotypes to missing (./.) if the GM tag is above a certain value (single line split for viewing):

    pseq ex1.vcf v-view --gmeta
          --mask reg=chr1:1002 include="gf(GM>8)" indiv=P001,P003,P005
      chr1:1002   rs1002   G/A   .   1   PASS
      P001  1  ./. [GM=6]
      P003  1  ./. [GM=8]
      P005  1  A/A [GM=10]
    

    Other commands perform different types of actions: in this example, the counts provides the number of minor alleles per variant:

    pseq ex1.vcf counts
      VAR        REF/ALT  MINOR  CNT  TOT
      chr1:1001  T/C      T      3    10
      chr1:1002  G/A      G      2    10
      chr1:1003  G/A      A      2    10
    

    For many purposes it is useful to work with a full project rather than a single VCF file. We first create a new PLINK/SEQ project called proj1:

    pseq proj1 new-project

    and then load the VCF into this project:

    pseq proj1 load-vcf --vcf ex1.vcf

    To check that the data have loaded as expected:

    pseq proj1 var-summary
    ---Variant DB summary---
    3 unique variants
    File tag : 1 (3 variants, 5 individuals)
    

    Now, the same commands performed above can applied to the project instead of the VCF: i.e.

    pseq proj1 counts

    should yield the same results as above.

    pbrowse interactive viewer

    Having created a project, it is now possible to use other PLINK/Seq tools with these data. If pbrowse is installed on your system (it should be following a standard installation) and assuming you are working in a graphical environment that can open a web-browser, the following command should open up this project in the browser:

    pbrowse proj1

    Clicking on Fetch will list all variants in the project. Clicking view on a specific variant will list the individual genotypes for that variant. You can also see these same views via the copy of pbrowse installed on this web-server for this project: for example, the variant list and individual genotypes.

    Rplinkseq

    Finally, if R is installed on your machine, and you've installed the Rplinkseq R library version of PLINK/Seq, then you should be able to start R, load the Rplinkseq library and attach this project so that it will be visible: (typing at the R command prompt)

    library(Rplinkseq)
    PLINK/Seq genetics library for R | v0.08 | 15-Mar-2012
    
    pseq.project( "proj1" )

    Now you can bring parts of the project data into R (for example, here, the entire 3 by 5 matrix of alternate allele counts for variants by individuals)

    k <- var.fetch() x.consensus.genotype( k )
         [,1] [,2] [,3] [,4] [,5]
    [1,]    2    0    1    2    2
    [2,]    2    1    1    2    2
    [3,]    0    0    1    1    0
    

    or apply your own functions across the dataset (here, simply listing the position and reference/alternate alleles for each variant):

    f <- function(v) { cat(v$CHR,":",v$BP1,v$CON$REF,v$CON$ALT,"\n") } var.iterate( f )
    1 : 1001 T C 
    1 : 1002 G A 
    1 : 1003 G A 
    

     


    Please read the rest of this website for more information on these tools.