PLINK/SEQ R Package

Rplinkseq is an R package that allows access to PLINK/Seq projects directly from R, so that R's rich set of statistical and visualisation tools can be utilised. Rplinkseq is implemented as an extension library, which enables access to the PLINK/Seq. This means that all the core features of the PLINK/Seq library (i.e. working with large VCFs, applying various masks and filters, etc) are available via R. However, many of the functions implemented within PSEQ (itself an application built on top of the core library) are not available. For example, the genotype-phenotype association tests, functions for summarising datasets, functions to create new resource databases, etc, implemented in PSEQ are not available in Rplinkseq. Thus the command-line PSEQ will remain the central tool for most purposes. The value of the R library is that it potentially provides a better platform for developing novel applications, particularly those involving more sophisticated statistical models or that involve visualisation. This page contains the following sections:

The Rplinkseq library can be downloaded from the main downloads page. Please see this page for notes on how to install this package and its dependencies.

R functions for working with PLINK/Seq projects

Assuming Rplinkseq has been correctly downloaded and installed, you can attach the Rplinkseq library like most R packages:

library(Rplinkseq)

which should produce something like the following:

  PLINK/SEQ genetics library | v0.08 | 16-Mar-2012

You can now attach a single PLINK/Seq project, or use some of the convenience functions for working with VCFs directly. This table lists some of the core functions:

Name Arguments Description
pseq.project() project-name Attach a PLINK/Seq project
var.fetch() mask-string Returns a list object of variants
load.vcf() filename-name Directly load a VCF file into a variant list
ind.fetch() phenotype(s), mask-string Returns a list of individual IDs and phenotypes
meta.fetch() variant tag(s), mask-string Returns a dataframe of variant meta-information
var.iterate() R function, mask-string Apply function to variant data, given a mask

The fundamental unit of Rplinkseq is the object that represents a single variant, as described below. Variant and genotype information is converted from a PLINK/Seq project (or, in some cases, a simple single VCF file) into one or variant lists. Variant lists can either be explicitly created and saved in memory (via the var.fetch() command), or can instead be created only temporarily, for the purposes of applying a specific function (using var.iterate()). This latter approach avoids having to store very large datasets in memory. The library provides a number of convenience functions for working with variant lists in R, described below.

Loading a PLINK/Seq project

In this page we consider the example dataset described in this tutorial that creates a project consisting of pilot 3 data from the 1000 Genomes project. The tutorial details the steps to make a project, named proj. It is attached to R as follows (assuming we are in the same directory as proj, otherwise the full path must be specified):

pseq.project( "proj" )

The pseq.summary() function will return the same information as the PSEQ summary command (as a character vector of output lines):

pseq.summary()
 [1] "---File-index summary---"                                                              
 [2] "Core project specification index : proj"                                               
 [3] "Core OUTPUT file : /Users/shaun/src/pseq/proj_out/"                                    
 [4] "Core RESOURCES file : /Users/shaun/data/hg18/"                                         
 [5] "Core LOCDB file : /Users/shaun/data/hg18/locdb"                                        
 [6] "Core INDDB file : /Users/shaun/src/pseq/proj_out/inddb"                                
 [7] "Core VARDB file : /Users/shaun/src/pseq/proj_out/vardb"                                
 [8] "Core LOG file : /Users/shaun/src/pseq/proj_out/log.txt"                                
 [9] "Core SEQDB file : /Users/shaun/data/hg18/seqdb"                                        
[10] "Core REFDB file : /Users/shaun/data/hg18/refdb"                                        
[11] "Added VCF : /Users/shaun/scratch/tmp-today/dist/vcf/CEU.exon.2010_09.genotypes.vcf.gz" 
[12] "Added VCF : /Users/shaun/scratch/tmp-today/dist/vcf/TSI.exon.2010_09.genotypes.vcf.gz" 
[13] "---Variant DB summary---"                                                              
[14] "4449 unique variants"                                                                  
[15] "File tag : 1 (3489 variants, 90 individuals)"                                          
[16] "File tag : 2 (3281 variants, 66 individuals)"                                          
[17] "---Individual DB summary---"                                                           
[18] "0 unique individuals"                                                                  
[19] "---Locus DB summary---"                                                                
[20] "Group : refseq (32739 entries) n/a"                                                    
...

It is also possible to load a VCF file directly into R using the load.vcf() function. In this case, it is not necesary that a PLINK/Seq project has been built previously. This will typically only be appropriate for small or moderate-sized VCF files:

d <- load.vcf( "/path/to/CEU.exon.2010_09.genotypes.vcf.gz" )

The object d is now a 3-element list containing reprentations of 1) the meta-information header fields from the VCF, 2) the header row containing the individual IDs, 3) a variant list, containing all 3489 variants from the CEU file:

str(d,max.level=1)
List of 3
 $ META: chr [1:16] "##fileformat=VCFv4.0" "##INFO=<ID=DP,Number=1,Type=Integer,Descr....
 $ ID  : chr [1:90] "NA06984" "NA06985" "NA06986" "NA06989" ...
 $ VAR :List of 3489

The variant-list in the third element of d should be identical to that which would be obtained by the command:

var.fetch( "file=CEU" , limit=10000 )

in the context of the project described above (that contained this particular VCF).

In addition, masks can be specified with the load.vcf(), such that only partial data are loaded (note, this example uses the x.core convenience function, described below):

d <- load.vcf( "CEU.exon...vcf.gz" , mask = "reg=chr6:25000000..35000000" ) head(cbind( x.core( d[[3]] , "CHR" ) , x.core( d[[3]] , "BP1" ) ))
       [,1]     [,2]
  [1,]    6 25957599
  [2,]    6 25958824
  [3,]    6 25970445
  [4,]    6 26232016
  [5,]    6 26232084
  [6,]    6 26653611

Variants-lists: representation of variant data in R

There are two ways to directly get at a project's variant data. var.fetch() returns a list of variants as an R list, whereas var.iterate() applies a function to a set of variants iteratively. var.fetch() is appropriate for obtaining relatively small sets of variants (that fit into memory). Depending on the sample and the computer, tens or even hundreds of thousands of variants are likely to be okay, but not millions (certainly not for large samples). var.iterate(), described below, is designed to handle larger variant sets. In addition, load.vcf() can be used to create variant-lists directly from a VCF, as described in the section above.

We'll begin with var.fetch(), which has two (optional) arguments:

  • mask is a character string, which is interpreted as a PLINK/SEQ mask.
  • limit sets the maximum number of variants to be returned, which is 1000 by default. The purpose of this limit is to avoid very large datasets being drawn into memory accidentally (e.g. by mis-specifying or forgetting the mask, for large datasets).

For example, in this particular dataset:

k <- var.fetch( "reg=chr22:17000000..19000000" )

returns all 11 variants on chromosome 22 between 17Mb and 19Mb as a single variant-list:

length(k)
[1] 11

Variation data are represented as list objects in R (i.e. k in this example, that actually contains a series of nested lists), with the following structure:

 CHR    integer    chromosome code
 BP1    integer    physical position 
 BP2    integer    second physical position (stop)
 ID     text       variant ID
 NS     integer    number of samples / VCFs
 META   list       variant meta-information
 CON    list       consensus sample-variant 
 S1     list       first sample-variant
 S2     list       second sample-variant
 ...

Here sample refers to a collection of individuals, not a single individual. The structure of the sample-variant list objects is as follows:

 FSET   integer    fileset identifier (0 for consensus)
 REF    text       reference allele
 ALT    text       alternate allele(s)
 QUAL   float      quality score (VCF field)
 FILTER text       filter string (VCF field)
 META   list       list of meta-information fields
 GENO   list       list of genotype data

R's str() function provides a convenient way to look at the structure of this variant-list (that is actually a series of nested lists):

str( k[[1]], max.level=2 )
List of 8
 $ CHR : int 22
 $ BP1 : int 17729354
 $ BP2 : int 17729354
 $ ID  : chr "rs116154317"
 $ NS  : int 1
 $ META:List of 3
  ..$ AA: chr "G"
  ..$ GP: chr "22:19349354"
  ..$ BN: int 132
 $ CON :List of 7
  ..$ FSET  : int 0
  ..$ REF   : chr "G"
  ..$ ALT   : chr "A"
  ..$ QUAL  : num -1
  ..$ FILTER: chr(0) 
  ..$ META  : Named list()
  ..$ GENO  :List of 2
 $ S1  :List of 7
  ..$ FSET  : int 1
  ..$ REF   : chr "G"
  ..$ ALT   : chr "A"
  ..$ QUAL  : num -1
  ..$ FILTER: chr ""
  ..$ META  :List of 6
  ..$ GENO  :List of 2

The format of the META lists will reflect the meta-information associated with that variant. For example:

> str(k[[1]]$S1$META)
List of 6
 $ AA: chr "G"
 $ GP: chr "22:19349354"
 $ AC: int 1
 $ AN: int 180
 $ DP: int 11831
 $ BN: int 132

The format of the GENO lists also represents any per-individual meta-fields, although the first field will always be GT for genotype. For example,

 $ GENO  :List of 2
  ..$ GT: int [1:156] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ DP: int [1:156] 24 13 291 20 4 28 26 80 238 194 ...

This indicates that 156 individuals have data, for a genotype and also for a field called DP (i.e. per individual read-depth). Currently, genotype data are downcoded to this simple non-reference allele count (0,1,2 or NA) in the GT list. Future releases will incorporate ways to access the other genotypic information that can be represented in VCFs and stored within VCFs and PLINK/Seq projects (e.g. haplotype phase, ploidy, multi-allelic variants).

The convenience functions described below provide a few ways to facilitate working with the variant data in list form -- i.e. extracting analysis-ready vectors and matrices of genotypes, phenotypes and meta-fields.

Fetching meta-information from variants and individuals

To obtain individual-level information (IDs, phenotypes, covaraites), use the ind.fetch function. For example, to return a list that contains a vector of individual IDs and a vector representing a phenotype labelled phe1:

> str ( ind.fetch( "phe1" ) )
List of 2
 $ ID  : chr [1:156] "NA06984" "NA06985" "NA06986" "NA06989" ...
 $ phe1: int [1:156] 2 2 2 2 2 2 2 2 2 2 ...

This function optionally accepts masks to be specified also. The full form takes a vector of phenotypes to fetch and a single string that defines a mask: e.g. to obtain only "cases" for phe1 and also two other covariates, if they existed, cov1 and cov2:

p <- ind.fetch( pheno = c( "phe1" , "cov1" , "cov2" ) , mask = "phe=phe1:2" )

To directly obtain variant-level meta-information from a project without having the overhead of generating whole variant/genotype list objects with var.fetch and then extracting the information (e.g. with a convenience function such as x.consensus()), use meta.fetch. This returns a data-frame object:

d <- meta.fetch( c( "CHROM" , "POS" , "AN" , "DP" ) , "file=CEU reg=chr7" ) head(d)
      POS  AN   DP CHROM
1 2257048 172 4763  chr7
2 5993056 162 3050  chr7
3 5993133 166 4094  chr7
4 5993234 164 3312  chr7
5 5993301 100 1872  chr7
6 5993468 112 2108  chr7

By default, meta.fetch() looks for the tag at the population variant level, then at the consensus sample-variant level. Thus, in a multi-sample context sample-specific tags will not be available, e.g. DP and AN are specific to a sample, and so are set to NA here when not restricting output to a single file (i.e. we've dropped the file=CEU mask from above):

d <- meta.fetch( c( "CHROM" , "POS" , "AN" , "DP" ) , "reg=chr7" ) head(d)
      POS AN DP CHROM
1 2256112 NA NA  chr7
2 2256163 NA NA  chr7
3 2257048 NA NA  chr7
4 5993056 NA NA  chr7
5 5993133 NA NA  chr7
6 5993234 NA NA  chr7

Applying functions to a dataset

It will often not be desirable or possible to extract an entire (exome-wide, or genome-wide) dataset fully into memory within R, using the var.fetch command. In this case, the var.iterate command can be used to apply a user-defined R function to each variant (or group of variants), with filters specified by a mask, using the same syntax as pseq.

For example, to consider a trivial example in which we count the number of transitions versus transversions: we first define global variables in R to keep track of the counts:

ti <<- tv <<- 0

and we define a function, that expects a variant-list object as described above (ignoring possible multi-allelic and monomorphic variants for this simple example). Using only slightly convoluted R code:

f_titv <- function( v ) { ifelse( sum( sort(c(v$CON$ALT,v$CON$REF)) == c("A","G") ) == 2 | sum( sort(c(v$CON$ALT,v$CON$REF)) == c("C","T") ) == 2 , ti <<- ti+1 , tv <<- tv+1 ) }

and then apply this function to the dataset:

var.iterate( f_titv )
> ti / tv
[1] 3.521341

As with the other commands described above, masks can be included as an argument to var.iterate():

var.iterate( f_titv , "reg=chr22" )

Note that performing these types of operations in R is typically a bit slower than the counterpart in PSEQ. For example, the calculation of the transition/transversion ratio here takes double the amount of time (about 2 seconds) compared to PSEQ's v-stats command, which does more and runs in under 1 second on the same machine, for this very small dataset. Obviously one wants to consider this fact when scaling to larger datasets; we suggest that you prototype methods on subsets of the data first (rather than waiting hours or days only to find something doesn't work as expected...)

Convenience functions

A number of functions in thsi package are simple convenience functions for working with variant lists as returned by var.fetch():

Name Arguments Description
x.core() variant-list, label Extract core fields from variant list (e.g. CHR, BP1)
x.meta() variant-list, label Extract variant (population) meta tags from variant list
x.consensus.core() variant-list, label Extract core fields from consensus variant list (e.g. CHR, BP1)
x.consensus.meta() variant-list, label Extract meta-fields from consensus variant list
x.sample.meta() variant-list, label Extract matrix of consensus and sample-specific meta-fields
x.consensus.genotype() variant-list Extract matrix of genotype from consensus variant (non-reference allele counts)
x.consensus.genotype() variant-list , label Extract matrix of per-individual/per-variant meta-information from consensus variant (non-reference allele counts)

Although the representation of data as nested, arbitrary lists is flexible, often one needs to extract simple vectors or matrices for downstream analytic procedures. For example, to obtain a vector of the physical position of each variant in the list of variants: for example

k <- var.fetch( "loc.req=refseq reg=chr6" ) d <- x.core( k , "BP1" ) str(d)
 int [1:302] 346464 346509 346527 346641 346644 352553 3022140 3058491 3672810 3682360 ...

To extract the the GP variant tag value for each variant from each sample:

d <- x.meta( k , "GP" ) str(d)
 chr [1:302] "6:401464" "6:401509" "6:401527" "6:401641" ...

To extract a matrix of the DP tag for the consensus and sample-specific variants:

d <- x.sample.meta( k , "DP" ) head(d)
> head(d) 
  CON   S1   S2
1  NA 6476 3988
2  NA 8152 6847
3  NA   NA 7694
4  NA 6178 3777
5  NA   NA 3491
6  NA 5925   NA

where CON means consensus, S1 and S2 are the two samples.

To look at genotype level data as a variant by individual matrix:

d <- x.consensus.genotype( k ) str( d )
 int [1:302, 1:156] 0 0 NA 0 NA 0 2 0 2 NA ...

As some simple examples of further analysis: a histogram of alternate allele frequency:

hist( apply( d , 1 , mean , na.rm = T ) / 2 )

Relating genotype to phenotype for a particualr SNP (number 139) in this case:

p <- ind.fetch( "phe1" ) t.test( d[ 139 , ] ~ p$phe1 )

To extract a matrix of read-depth per individual (the DP tag):

x.consensus.genotype( k , "DP" )

Over time more functions will be added to assist working with genetic data in this context. Overall, the focus for the R interface will be a) to prototype new methods, b) to provide an interactive framework for exploring variation in a relatively constrained genomic context (e.g. a single gene). We expect the core analysis functions will continue to reside in PSEQ.