The R interface

The PLINK/Seq library can be compiled as an R extension library. This enables users to access PLINK/Seq projects directly via the R interface, and to use the rich set of statistical and visualisation tools available in R. A small set of helper utility functions is provided to assist the exchange of data between PLINK/Seq and R -- these are very much under development however.

R functions for working with PLINK/Seq projects

Variants-lists: representation of variant data in R

Variation data are represented as list objects in R, with the following structure:

 CHR    integer    chromosome code
 BP1    integer    physical position 
 BP2    integer    second physical position (stop)
 ID     text       variant ID
 NS     integer    number of samples / VCFs
 CON    list       consensus sample-variant 
 S1     list       first sample-variant
 S2     list       second sample-variant
 ...

Note -- here sample refers to a collection of individuals, not a single individual. The structure of the sample-variant list objects are as follows:

 IDX    integer    index (internal use only)
 FSET   integer    fileset identifier (0 for consensus)
 REF    text       reference allele
 ALT    text       alternate allele(s)
 QUAL   float      quality score (VCF field)
 FILTER text       filter string (VCF field)
 META   list       list of meta-information fields
 GENO   list       list of genotype data

The format of the META lists will reflect the meta-information associated with that variant: e.g. here is an exert from an actual dataset:

  ..$ META  :List of 6
  .. ..$ AA: chr "C"
  .. ..$ GP: chr "22:17662444"
  .. ..$ AC: int 1
  .. ..$ AN: int 178
  .. ..$ BN: int 127
  .. ..$ DP: int 3798

The format of the GENO lists also represents any per-individual meta-fields, although the first field will always be GT for genotype. For example,

  ..$ GENO  :List of 2
  .. ..$ GT: int [1:156] 1 1 0 1 2 0 0 1 0 2 ...
  .. ..$ DP: int [1:156] 22 18 25 99 126 21 10 38 26 19 ...

This indicates that 156 individuals have data, for a genotype and also for a field called DP (i.e. per individual read-depth). Currently, genotype data are downcoded to this simple non-reference allele count (0,1,2 or NA). Future releases will incorporate ways to access the other genotypic information that can be represented in VCFs and stored within PLINK/Seq (e.g. haplotype phase, ploidy, multi-allelic variants).

Convenience functions

Most of the current functions in pseq.R represent simple convenience functions for working with variant lists. For example, the function

k <- var.fetch( "reg=chr22:17000000..19000000" )

will return all variants on chromosome 22 between 17Mb and 19Mb, as a list of variant-lists. Although the representation of data as nested, arbitrary lists is flexible, often one needs to extract simple vectors or matrices for downstream analytic procedures. For example, to obtain a vector of the physical position of each variant in the list of variants:

x.core( k , "BP1" )
 [1] 16042444 16042793 16049306 17729354 18338811 18338829 18340512 18343258
 [9] 18347480 18348971 18349043 18349075 18349106 18349495 20328280 20328448
[17] 20648354 20648538 20648671 20653003 21833121 21833170 22909503 22910156
[25] 22910807 22910897 22912041 22913230 29153196 29153250 29186050 29186121
[33] 29187329 29187373 29187448 29190830 29194610 29194627 29194655 29196053
...

To extract the read-depth DP tag value for each variant, as DP is a meta-field:

x.consensus.genotype( k )
  str( x.consensus.genotype( k )  ) 
   int [1:86, 1:156] NA 0 1 NA 0 0 0 0 0 2 ...

To extract a matrix of read-depth per individual (the DP tag)

x.consensus.genotype( k , "DP" )

The variant list can be subsetted in the usual way

x.consensus.genotype( k[1..10] , "DP" )

to get only the first 10 variants in the extracted set, for example

Fetching individual and phenotype data

To obtain a list of IDs, and phenotype, after applying a mask, use the command ind.fetch()

to return a vector of IDs,

ind.fetch( "phe1" )

to return a list of two vectors: IDs, and 0s or 1s for a case/control phenotype; finally, a mask can also be given

ind.fetch( "phe1" , "phe=phe:2" )

As with all R functions, the names of the arguments can be included in code for increased readability:

ind.fetch( pheno = "phe1" , mask = "phe=phe:2" )

Applying functions to a dataset

It will often not be desirable or possible to extract an entire (exome-wide, or genome-wide) dataset fully into memory within R, using the var.fetch command. In this case, the var.iterate command can be used to apply a user-defined R function to each variant (or group of variants), with filters specified by a mask, using the same syntax as pseq.

For example, to consider a trivial example in which we count the number of transitions versus transversions: we first define global variables in R to keep track of the counts:

ti <<- tv <<- 0

and we define a function, that expects a variant-list object as described above (ignore possible multi-allelic and monomorphic variants for this simple example). Using only slightly convoluted R code:

f_titv <- function( v ) { ifelse( sum( sort(c(v$CON$ALT,v$CON$REF)) == c("A","G") ) == 2 | sum( sort(c(v$CON$ALT,v$CON$REF)) == c("C","T") ) == 2 , ti <<- ti+1 , tv <<- tv+1 ) }

and then apply this function to the dataset:

var.iterate( f_titv )
> ti / tv
[1] 3.521341

masks can be included, e.g. var.iterate( f_titv , " reg=chr22 " )