## PLINK/SEQ and GWAS data

This page describes some basic ways in which the PLINK/SEQ library can be used to analyse GWAS datasets. In comparison to the PLINK package, advantages include:

- the ability to handle larger datasets (e.g. for millions of SNPs directly genotyped)
- a more flexible analysis of post-imputation, probabilistically-called genotypes
- that it is much easier to extend functionality via R

Disadvantages are that it:

- is potentially quite a bit slower than PLINK
- does not currently contain a similarly large range of GWAS functions

In other words, if PLINK or a similar package works for your GWAS data, use it. Otherwise, you might want to consider using the approaches described below

On this page, we 1) introduce the data, 2) show
a basic GWAS using PLINK, 3) repeat that analysis
in PSEQ and 4), in R using the
PLINK/SEQ library, and finally, 5) illustrate how *soft-called* genotype data can be analysed.

### The data

We will use the data from the PLINK resources
page. Specifically, download this
ZIP archive and extract the files of interest (**bolded**) as
follows:

` unzip example.zip `

Archive: example.zip creating: example/inflating: example/wgas1.mapinflating: example/plink.exe inflating: example/extra.map inflating: example/pop.cov inflating: example/Haploview.jarinflating: example/wgas1.pedinflating: example/extra.ped inflating: example/gPLINK.jar inflating: example/command-list.txt

We can convert this PED/MAP file to a PLINK binary PED (BED) file-format (so that we can subsequently load it into a PLINK/SEQ project), by use of the following PLINK command:

` plink --file example/wgas1 --make-bed --out example/wgas2 `

Some key lines from the LOG file are shown below:

228694 (of 228694) markers to be included from [ example/wgas1.map ] 90 individuals read from [ example/wgas1.ped ] Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) ... 49 cases, 41 controls and 0 missing 45 males, 45 females, and 0 of unspecified sex ... Writing pedigree information to [ example/wgas2.fam ] Writing map (extended format) information to [ example/wgas2.bim ] Writing genotype bitfile to [ example/wgas2.bed ]

Now that we've created a PLINK binary fileset, we will perform some of the steps of a basic GWAS analysis first using PLINK and then PLINK/SEQ.

### Basic GWAS analysis using PLINK

Here, we will perform only two steps: first, to generate MDS components:

```
plink --bfile example/wgas2
--mds-plot 2
--out example/mds1
```

Second, to perform a logistic regression of disease on SNP genotype, covarying for the MDS components:

```
plink --bfile example/wgas2
--logistic
--covar example/mds1.mds
--covar-name C1-C2
--out example/assoc1
```

By way of illustrating some of the top results, we note that only
two SNPs are significant at a *p*-value threshold of 1e-4:

```
awk ' NR == 1 || ( $5 == "ADD" && $9 < 1e-4 ) '
example/assoc1.assoc.logistic
```

CHR SNP BP A1 TEST NMISS OR STAT P 8 rs11204005 12895576 A ADD 90 0.0601 -4.217 2.48e-05 8 rs2460338 12914531 G ADD 89 0.06723 -4.179 2.93e-05

The first the SNPs have the following results:

`grep -w ADD example/assoc1.assoc.logistic | head -3 `

CHR SNP BP A1 TEST NMISS OR STAT P 1 rs3094315 792429 G ADD 89 2.383 1.253 0.2102 1 rs6672353 817376 A ADD 89 3e+08 0.00081 0.9994 1 rs4040617 819185 G ADD 90 2.234 1.135 0.2564

### Basic GWAS analysis using PSEQ

We can now create a PLINK/SEQ project and load the binary-PED file into it. First, we must create a simple project. In this particular case, we do not link to any existing reference datasets and so the syntax is simply:

` pseq proj new-project `

Next, we load the GWAS data into the project, using the
`load-plink` command:

` pseq proj load-plink --file example/wgas2 --id gwas `

expecting 228694 variants on 90 individuals(loading...)inserted 90 individuals, 228694 variants

By default, the phenotype in the FAM file (that corresponds to the
sixth column in the original PED file) is called `phe1` and
automatically stored in the PLINK/SEQ project's INDDB.

We must also load the covariates (the two MDS components) into the
project, conforming to the format described here. An appropriately-formatted file can
be downloaded from this link: mds1.phe.
Alternatively, it can be generated with the following `awk`
command:

```
awk ' BEGIN { printf "##mds1,Float,X,\"MDS component 1\"\n"; \
printf "##mds2,Float,X,\"MDS component 2\"\n"; \
printf "#ID\tmds1\tmds2\n" } \
NR>1 { printf $1"\t"$4"\t"$5"\n" } ' example/mds1.mds \
> example/mds1.phe
```

To load this covariate file into the INDDB:

` pseq proj load-pheno --file example/mds1.phe `

Processed 90 rows

To run a logistic regression of disease on allele count and
covariates for each SNP, use the `glm` command. This will adopt
a logistic or linear regression, depending on the nature of the
phenotype specified.

` pseq proj glm --phenotype phe1 --covar mds1 mds2 `

The ouput from this command lists the variant position and ID, the
reference and alternate alleles, the number of non-missing individuals
included in analysis, the alternate allele frequency F (for binary
traits with no covariates, F_A and F_U represent the case and controls
frequencies), then the odds ratio (defined with respect to the
alternate allele, e.g. for the first variant below, *A* is
protective), its standard error (on log scale), test statistic and
p-value.

VAR REF ALT N F OR SE STAT P chr1:792429:rs3094315 G A 89 0.876 0.419 0.693 -1.253 0.2102 chr1:817376:rs6672353 A G 89 0.994 2e-09 24378 -0.0008 0.9994 chr1:819185:rs4040617 G A 90 0.883 0.447 0.708 -1.1350 0.2564 ...

Note that in this example the designation of *reference* and
*alternate* are arbitrary, reflecting the annotation in the
GWAS dataset of *A1* and *A2*, rather than with respect
to any known reference genome sequence. As a consequence, the odds
ratios will not necessarily be for the same allele in PSEQ as in
PLINK, which only has the concept of the minor allele, rather than
reference and non-reference.

Examining the output, it has reproduced the basic logistic regression results seen above: e.g. the same two SNPs ranked highest:

VAR REF ALT N F OR SE STAT P chr8:12895576:rs11204005 A G 90 0.522 16.638 0.667 4.217 2.5e-05 chr8:12914531:rs2460338 G C 89 0.522 14.874 0.646 4.179 2.9e-05

In general, we see as expected the identical set of results being generated by PSEQ as PLINK for the analysis of GWAS data by logistic regression with two covariates.

### Basic GWAS analysis using the R interface to PLINK/SEQ

As discussed here, PLINK/SEQ projects can be accessed via R. In R,

```
library(Rplinkseq)
pseq.project("proj")
```

This assumes that `proj` is in the working directory. We
will use the `var.iterate()` function to obtain for each marker
the odds ratio, standard error and p-value from a logistic regression
of disease on allele count, controlling for the two MDS components --
that is, the same analysis as above.

#### Phenotype and covariate information

The `ind.fetch()` function can be used to extract phenotype
information in the INDDB, for example:

```
p <- ind.fetch( c("phe1","mds1","mds2") )
str(p)
```

List of 4 $ ID : chr [1:90] "CH18526" "CH18524" "CH18529" "CH18558" ... $ phe1: int [1:90] 1 1 1 1 1 1 1 2 2 1 ... $ mds1: num [1:90] -0.021 -0.0234 -0.0215 -0.0172 -0.0215 ... $ mds2: num [1:90] 0.0054 0.0138 0.0175 0.0206 -0.0108 ...

The full form of the `ind.fetch()` command is:

ind.fetch( pheno = { (vector of) phenotype name(s) in INDDB } , mask = { mask specification string } )

We can confirm that the total number of cases (`phe1=2`) and
controls (`=1`) is the same as in the original dataset.

` table( p$phe1 ) `

1 2 41 49

We can also confirm that the MDS components look as expected (in this dataset, we expect two clusters, corresponding to Chinese and Japanese HapMap samples):

` plot( p$mds1 , p$mds2 ) `

#### Defining the analysis function to apply

We need to define a function that takes a variant list object (the
format of which is described here). To
demonstrate the development of this, we'll use the `var.fetch()`
function to get data on the first variant in the project.

` k <- var.fetch( limit = 1 ) `

This returns a list object:

` str( k ) `

List of 1 $ :List of 7 ..$ CHR: int 1 ..$ BP1: int 792429 ..$ BP2: int 792429 ..$ ID : chr "rs3094315" ..$ NS : int 1 ..$ CON:List of 8 .. ..$ IDX : int 0 .. ..$ FSET : int 0 .. ..$ REF : chr "G" .. ..$ ALT : chr "A" .. ..$ QUAL : num -1 .. ..$ FILTER: chr "" .. ..$ META : list() .. ..$ GENO :List of 1 .. .. ..$GT: int [1:90] 2 2 1 2 2 2 1 1 2 2 ...

In this case, the `CON` element (the *consensus*
sample-variant) contains an element called `GENO`, which
contains genotype information (and any meta-information on genotypes
too). In this example, the `GENO` sub-element contains counts
of the number of *alternate* alleles for each individual
(recall that designation as *reference* and *alternate*
is arbitrary in this particular GWAS-centric example). Thus, the
element `CON$GENO$GT` effectively contains the genotype calls for
all 90 individuals:

` table( k[[1]]$CON$GENO$GT , useNA="always" ) `

1 2 <NA> 22 67 1

In other words, given `ALT` is *A* and `REF`
is *G* in this example, this implies 22 *A/G*
heterozygotes, 67 *A/A* homozygotes and 1 individual with a
missing genotype. So *G* is the minor allele in this example
(although it happens to be the "reference" here).

To perform a logistic regresion for this variant, we can use the `glm()`
function:

```
summary( glm(
p$phe1 == 2 ~ k[[1]]$CON$GENO$GT + p$mds1 + p$mds2 ,
family="binomial" ) )
```

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8436 1.2767 1.444 0.149 k[[1]]$CON$GENO$GT -0.8682 0.6929 -1.253 0.210 p$mds1 70.7133 13.6709 5.173 2.31e-07 *** p$mds2 -14.1353 17.0560 -0.829 0.407

Here we see that for this first SNP, rs3094315, the association analysis yields similar results, as expected, to the PLINK and PSEQ analyses: an odds ratio of 2.383 (i.e. 1/exp(-0.8682)) for the minor allele, a standard error of 0.693 and a p-value of 0.210.

To apply such an analysis to all variants in a dataset (optionally,
after applying various masks), we do not
need to load all variants into memory. Rather, here we will define a
function to perform the analysis, in this instance that is performed a
variant at a time. We will then use the `var.iterate()` function, which
is similar in spirit to the family of `apply()` functions in R.

Note that `var.iterate()` does not currently return
anything. Rather, it is the job of the user's function to collate any
results of interest in one or more global objects. Depending on the
nature of the data and the analysis, one may want to adopt different
approaches to collecting results (i.e. one could directly write to a
file with the `sink()` command, or use approaches other than
`rbind()` which is unlikely to be very efficient).

```
f <- function(v)
{
t <- summary( glm(
p$phe1 == 2 ~ v$CON$GENO$GT + p$mds1 + p$mds2 ,
family="binomial" )
)$coef[c(2,6,14)] ;
r <<- rbind( r , as.numeric(t) )
}
```

We generate an object that will be used to collect the results:

`r <- numeric()`

In other words, we expect a variant object (called `v`
here), perform the regression described above, and then use a
short-cut to return the three quantities of interest (coefficient, SE
and p-value, assuming there are 3 predictor variables). Applying this
function to the first SNP:

```
f( k[[1]] )
r
```

[1] -0.8682405 0.6928636 0.2101624

We now test whether or not the function works, applying it to only the first few SNPs (first clearing the results matrix):

```
r <- numeric()
var.iterate( f , "limit=10" )
```

We can then observe these results for the first 10 SNPs, in the 10-by-3 matrix that has been populated:

` r `

[,1] [,2] [,3] [1,] -0.8682405 0.6928636 0.2101624 [2,] -14.0002760 1455.3976471 0.9923248 [3,] -0.8038097 0.7081790 0.2563596 [4,] 71.2993081 16.5166310 NA [5,] 71.2993081 16.5166310 NA [6,] 1.0899151 0.8817309 0.2164182 [7,] 0.2318913 0.3687724 0.5294672 [8,] 70.7493405 16.5320981 NA [9,] -0.1884215 1.1037734 0.8644544 [10,] -0.5227010 0.3989326 0.1901116

To apply this function to all variants,

```
r <- numeric()
var.iterate( f )
```

We now have results for all SNPs in the `r` matrix. We can plot the p-values as follows:

` plot( -log10( r[,3] ) ) `

We can also compare these p-values to those generated by PLINK:

```
d <- read.table("example/assoc1.assoc.logistic" , header=T)
d <- d[ d$TEST == "ADD" , ]
plot( r[,3] , d$P )
```

As expected, we can confirm that similar results have been produced by both analysis pipelines. Although it is more cumbersome to use R for this particular analysis, it is easier to use this framework to extend to more complex analyses that aren't necessarily implemented in a genetics package.

### Comparison across approaches: speed and filesize

To give a sense of how these different tools perform on this particular task, for this particular struture of dataset:

Tool | Size of dataset | Time: MAF | Time: logistic regression |
---|---|---|---|

PLINK, PED file | 80M | 14.6s | 1m15s |

PLINK, BED file | 12.5M | 3.7s | 50.8s |

PSEQ | 33M | 14.1s | 1m33s |

R and PLINK/SEQ | 33M | 31s | 1hr12m |

**Note**: the MAF column means that we simply
calculate the MAF for every variant; the logistic function is the
analysis described above, including two covariates, for every variant
in this example dataset.

Some limited conclusions that can be drawn from this particular
comparison: 1) as noted above, PSEQ is generally a little slower than
PLINK for basic GWAS analysis, i.e. the price paid for greater
scalability and extensibility; 2) the logistic regression analysis in
R is particuarly slow, although this reflects the particular
`glm()` function. That is, in this particular example, the
price paid for getting data from a PLINK/SEQ project into R, *per
se* is about the same as the cost of initially reading the data
into PSEQ -- i.e. the simple MAF function in R only takes twice as
long as PSEQ. As mentioned above, for certain purposes,
e.g. prototyping new methods, this will not be an issue.

### Analysis of imputed GWAS data

In this example, we've directly loaded basic SNP data from a PLINK
binary-PED fileset. The `glm` command can be used with any
source of data, however, including VCFs. To base the PSEQ `glm` tests on
soft-called genotypes instead of the hard-calls in the `GT` field, you can add
one of the following options:

` --options postprobs=PP `

to indicate the presence of a tag in the `FORMAT` fields
called `PP` that contains genotype posterior probabilities;
alternatively,

` --options dosages=EC `

which can be set to point to dosages (expected counts) in the VCF to be used instead, e.g.

GT:EC 0/1:1.023 ./.:0.234 0/0:0.001

**Note:** These two options not available in current release.