PLINK/SEQ and GWAS data

This page describes some basic ways in which the PLINK/SEQ library can be used to analyse GWAS datasets. In comparison to the PLINK package, advantages include:

  • the ability to handle larger datasets (e.g. for millions of SNPs directly genotyped)
  • a more flexible analysis of post-imputation, probabilistically-called genotypes
  • that it is much easier to extend functionality via R

Disadvantages are that it:

  • is potentially quite a bit slower than PLINK
  • does not currently contain a similarly large range of GWAS functions

In other words, if PLINK or a similar package works for your GWAS data, use it. Otherwise, you might want to consider using the approaches described below

On this page, we 1) introduce the data, 2) show a basic GWAS using PLINK, 3) repeat that analysis in PSEQ and 4), in R using the PLINK/SEQ library, and finally, 5) illustrate how soft-called genotype data can be analysed.

The data

We will use the data from the PLINK resources page. Specifically, download this ZIP archive and extract the files of interest (bolded) as follows:

unzip example.zip
Archive:  example.zip
   creating: example/
  inflating: example/wgas1.map       
  inflating: example/plink.exe       
  inflating: example/extra.map       
  inflating: example/pop.cov         
  inflating: example/Haploview.jar   
  inflating: example/wgas1.ped       
  inflating: example/extra.ped       
  inflating: example/gPLINK.jar      
  inflating: example/command-list.txt  

We can convert this PED/MAP file to a PLINK binary PED (BED) file-format (so that we can subsequently load it into a PLINK/SEQ project), by use of the following PLINK command:

plink --file example/wgas1 --make-bed --out example/wgas2

Some key lines from the LOG file are shown below:

 228694 (of 228694) markers to be included from [ example/wgas1.map ]
 90 individuals read from [ example/wgas1.ped ] 
 Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
 ...
 49 cases, 41 controls and 0 missing
 45 males, 45 females, and 0 of unspecified sex
 ...
 Writing pedigree information to [ example/wgas2.fam ] 
 Writing map (extended format) information to [ example/wgas2.bim ] 
 Writing genotype bitfile to [ example/wgas2.bed ] 

Now that we've created a PLINK binary fileset, we will perform some of the steps of a basic GWAS analysis first using PLINK and then PLINK/SEQ.

Basic GWAS analysis using PLINK

Here, we will perform only two steps: first, to generate MDS components:

plink --bfile example/wgas2 --mds-plot 2 --out example/mds1

Second, to perform a logistic regression of disease on SNP genotype, covarying for the MDS components:

plink --bfile example/wgas2 --logistic --covar example/mds1.mds --covar-name C1-C2 --out example/assoc1

By way of illustrating some of the top results, we note that only two SNPs are significant at a p-value threshold of 1e-4:

awk ' NR == 1 || ( $5 == "ADD" && $9 < 1e-4 ) ' example/assoc1.assoc.logistic
 CHR         SNP         BP   A1   TEST  NMISS       OR     STAT         P 
   8  rs11204005   12895576    A    ADD     90   0.0601   -4.217  2.48e-05
   8   rs2460338   12914531    G    ADD     89  0.06723   -4.179  2.93e-05

The first the SNPs have the following results:

grep -w ADD example/assoc1.assoc.logistic | head -3
 CHR         SNP         BP   A1   TEST  NMISS       OR     STAT         P 
   1   rs3094315     792429    G    ADD     89    2.383    1.253    0.2102
   1   rs6672353     817376    A    ADD     89    3e+08  0.00081    0.9994
   1   rs4040617     819185    G    ADD     90    2.234    1.135    0.2564

Basic GWAS analysis using PSEQ

We can now create a PLINK/SEQ project and load the binary-PED file into it. First, we must create a simple project. In this particular case, we do not link to any existing reference datasets and so the syntax is simply:

pseq proj new-project

Next, we load the GWAS data into the project, using the load-plink command:

pseq proj load-plink --file example/wgas2 --id gwas
 expecting 228694 variants on 90 individuals
 (loading...)
 inserted 90 individuals, 228694 variants

By default, the phenotype in the FAM file (that corresponds to the sixth column in the original PED file) is called phe1 and automatically stored in the PLINK/SEQ project's INDDB.

We must also load the covariates (the two MDS components) into the project, conforming to the format described here. An appropriately-formatted file can be downloaded from this link: mds1.phe. Alternatively, it can be generated with the following awk command:

awk ' BEGIN { printf "##mds1,Float,X,\"MDS component 1\"\n"; \ printf "##mds2,Float,X,\"MDS component 2\"\n"; \ printf "#ID\tmds1\tmds2\n" } \ NR>1 { printf $1"\t"$4"\t"$5"\n" } ' example/mds1.mds \ > example/mds1.phe

To load this covariate file into the INDDB:

pseq proj load-pheno --file example/mds1.phe
 Processed 90 rows

To run a logistic regression of disease on allele count and covariates for each SNP, use the glm command. This will adopt a logistic or linear regression, depending on the nature of the phenotype specified.

pseq proj glm --phenotype phe1 --covar mds1 mds2

The ouput from this command lists the variant position and ID, the reference and alternate alleles, the number of non-missing individuals included in analysis, the alternate allele frequency F (for binary traits with no covariates, F_A and F_U represent the case and controls frequencies), then the odds ratio (defined with respect to the alternate allele, e.g. for the first variant below, A is protective), its standard error (on log scale), test statistic and p-value.

VAR		       REF  ALT	  N	 F     OR     SE     STAT       P
chr1:792429:rs3094315    G    A  89  0.876  0.419  0.693   -1.253  0.2102
chr1:817376:rs6672353    A    G  89  0.994  2e-09  24378  -0.0008  0.9994
chr1:819185:rs4040617    G    A  90  0.883  0.447  0.708  -1.1350  0.2564
...

Note that in this example the designation of reference and alternate are arbitrary, reflecting the annotation in the GWAS dataset of A1 and A2, rather than with respect to any known reference genome sequence. As a consequence, the odds ratios will not necessarily be for the same allele in PSEQ as in PLINK, which only has the concept of the minor allele, rather than reference and non-reference.

Examining the output, it has reproduced the basic logistic regression results seen above: e.g. the same two SNPs ranked highest:

VAR                       REF  ALT   N      F     OR    SE  STAT       P
chr8:12895576:rs11204005    A    G  90  0.522 16.638 0.667 4.217 2.5e-05
chr8:12914531:rs2460338     G    C  89  0.522 14.874 0.646 4.179 2.9e-05

In general, we see as expected the identical set of results being generated by PSEQ as PLINK for the analysis of GWAS data by logistic regression with two covariates.

Basic GWAS analysis using the R interface to PLINK/SEQ

As discussed here, PLINK/SEQ projects can be accessed via R. In R,

library(Rplinkseq) pseq.project("proj")

This assumes that proj is in the working directory. We will use the var.iterate() function to obtain for each marker the odds ratio, standard error and p-value from a logistic regression of disease on allele count, controlling for the two MDS components -- that is, the same analysis as above.

Phenotype and covariate information

The ind.fetch() function can be used to extract phenotype information in the INDDB, for example:

p <- ind.fetch( c("phe1","mds1","mds2") ) str(p)
List of 4
 $ ID  : chr [1:90] "CH18526" "CH18524" "CH18529" "CH18558" ...
 $ phe1: int [1:90] 1 1 1 1 1 1 1 2 2 1 ...
 $ mds1: num [1:90] -0.021 -0.0234 -0.0215 -0.0172 -0.0215 ...
 $ mds2: num [1:90] 0.0054 0.0138 0.0175 0.0206 -0.0108 ...

The full form of the ind.fetch() command is:

 ind.fetch( pheno = { (vector of) phenotype name(s) in INDDB } , 
            mask  = { mask specification string } )

We can confirm that the total number of cases (phe1=2) and controls (=1) is the same as in the original dataset.

table( p$phe1 )
 1  2 
41 49 

We can also confirm that the MDS components look as expected (in this dataset, we expect two clusters, corresponding to Chinese and Japanese HapMap samples):

plot( p$mds1 , p$mds2 )

Defining the analysis function to apply

We need to define a function that takes a variant list object (the format of which is described here). To demonstrate the development of this, we'll use the var.fetch() function to get data on the first variant in the project.

k <- var.fetch( limit = 1 )

This returns a list object:

str( k )
List of 1
 $ :List of 7
  ..$ CHR: int 1
  ..$ BP1: int 792429
  ..$ BP2: int 792429
  ..$ ID : chr "rs3094315"
  ..$ NS : int 1
  ..$ CON:List of 8
  .. ..$ IDX   : int 0
  .. ..$ FSET  : int 0
  .. ..$ REF   : chr "G"
  .. ..$ ALT   : chr "A"
  .. ..$ QUAL  : num -1
  .. ..$ FILTER: chr ""
  .. ..$ META  : list()
  .. ..$ GENO  :List of 1
  .. .. ..$ GT: int [1:90] 2 2 1 2 2 2 1 1 2 2 ...

In this case, the CON element (the consensus sample-variant) contains an element called GENO, which contains genotype information (and any meta-information on genotypes too). In this example, the GENO sub-element contains counts of the number of alternate alleles for each individual (recall that designation as reference and alternate is arbitrary in this particular GWAS-centric example). Thus, the element CON$GENO$GT effectively contains the genotype calls for all 90 individuals:

table( k[[1]]$CON$GENO$GT , useNA="always" )
   1    2 <NA> 
  22   67    1 

In other words, given ALT is A and REF is G in this example, this implies 22 A/G heterozygotes, 67 A/A homozygotes and 1 individual with a missing genotype. So G is the minor allele in this example (although it happens to be the "reference" here).

To perform a logistic regresion for this variant, we can use the glm() function:

summary( glm( p$phe1 == 2 ~ k[[1]]$CON$GENO$GT + p$mds1 + p$mds2 , family="binomial" ) )
Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.8436     1.2767   1.444    0.149    
k[[1]]$CON$GENO$GT  -0.8682     0.6929  -1.253    0.210    
p$mds1              70.7133    13.6709   5.173 2.31e-07 ***
p$mds2             -14.1353    17.0560  -0.829    0.407    

Here we see that for this first SNP, rs3094315, the association analysis yields similar results, as expected, to the PLINK and PSEQ analyses: an odds ratio of 2.383 (i.e. 1/exp(-0.8682)) for the minor allele, a standard error of 0.693 and a p-value of 0.210.

To apply such an analysis to all variants in a dataset (optionally, after applying various masks), we do not need to load all variants into memory. Rather, here we will define a function to perform the analysis, in this instance that is performed a variant at a time. We will then use the var.iterate() function, which is similar in spirit to the family of apply() functions in R.

Note that var.iterate() does not currently return anything. Rather, it is the job of the user's function to collate any results of interest in one or more global objects. Depending on the nature of the data and the analysis, one may want to adopt different approaches to collecting results (i.e. one could directly write to a file with the sink() command, or use approaches other than rbind() which is unlikely to be very efficient).

f <- function(v) { t <- summary( glm( p$phe1 == 2 ~ v$CON$GENO$GT + p$mds1 + p$mds2 , family="binomial" ) )$coef[c(2,6,14)] ; r <<- rbind( r , as.numeric(t) ) }

We generate an object that will be used to collect the results:

r <- numeric()

In other words, we expect a variant object (called v here), perform the regression described above, and then use a short-cut to return the three quantities of interest (coefficient, SE and p-value, assuming there are 3 predictor variables). Applying this function to the first SNP:

f( k[[1]] ) r
[1] -0.8682405  0.6928636  0.2101624

We now test whether or not the function works, applying it to only the first few SNPs (first clearing the results matrix):

r <- numeric() var.iterate( f , "limit=10" )

We can then observe these results for the first 10 SNPs, in the 10-by-3 matrix that has been populated:

r
             [,1]         [,2]      [,3]
 [1,]  -0.8682405    0.6928636 0.2101624
 [2,] -14.0002760 1455.3976471 0.9923248
 [3,]  -0.8038097    0.7081790 0.2563596
 [4,]  71.2993081   16.5166310        NA
 [5,]  71.2993081   16.5166310        NA
 [6,]   1.0899151    0.8817309 0.2164182
 [7,]   0.2318913    0.3687724 0.5294672
 [8,]  70.7493405   16.5320981        NA
 [9,]  -0.1884215    1.1037734 0.8644544
[10,]  -0.5227010    0.3989326 0.1901116

To apply this function to all variants,

r <- numeric() var.iterate( f )

We now have results for all SNPs in the r matrix. We can plot the p-values as follows:

plot( -log10( r[,3] ) )

We can also compare these p-values to those generated by PLINK:

d <- read.table("example/assoc1.assoc.logistic" , header=T) d <- d[ d$TEST == "ADD" , ] plot( r[,3] , d$P )

As expected, we can confirm that similar results have been produced by both analysis pipelines. Although it is more cumbersome to use R for this particular analysis, it is easier to use this framework to extend to more complex analyses that aren't necessarily implemented in a genetics package.

Comparison across approaches: speed and filesize

To give a sense of how these different tools perform on this particular task, for this particular struture of dataset:

Tool Size of dataset Time: MAF Time: logistic regression
PLINK, PED file 80M 14.6s 1m15s
PLINK, BED file 12.5M 3.7s 50.8s
PSEQ 33M 14.1s 1m33s
R and PLINK/SEQ 33M 31s 1hr12m

Note: the MAF column means that we simply calculate the MAF for every variant; the logistic function is the analysis described above, including two covariates, for every variant in this example dataset.

Some limited conclusions that can be drawn from this particular comparison: 1) as noted above, PSEQ is generally a little slower than PLINK for basic GWAS analysis, i.e. the price paid for greater scalability and extensibility; 2) the logistic regression analysis in R is particuarly slow, although this reflects the particular glm() function. That is, in this particular example, the price paid for getting data from a PLINK/SEQ project into R, per se is about the same as the cost of initially reading the data into PSEQ -- i.e. the simple MAF function in R only takes twice as long as PSEQ. As mentioned above, for certain purposes, e.g. prototyping new methods, this will not be an issue.

Analysis of imputed GWAS data

In this example, we've directly loaded basic SNP data from a PLINK binary-PED fileset. The glm command can be used with any source of data, however, including VCFs. To base the PSEQ glm tests on soft-called genotypes instead of the hard-calls in the GT field, you can add one of the following options:

--options postprobs=PP

to indicate the presence of a tag in the FORMAT fields called PP that contains genotype posterior probabilities; alternatively,

--options dosages=EC

which can be set to point to dosages (expected counts) in the VCF to be used instead, e.g.

  GT:EC  0/1:1.023  ./.:0.234  0/0:0.001 

Note: These two options not available in current release.