1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
R plugin functions
This page describes PLINK's limited support for R-based 'plug-in'
functions. In this manner, users can extend the basic functionality
of PLINK to better meet their own needs.
R is a powerful, freely-available
package for statistical computing. PLINK uses the Rserve package to communicate
with R. There are some notes on installing and running the Rserve
package below.
The idea is that some analyses, such as survival analysis for example,
are already implemented in R but not available in
PLINK. Having a simple interface for accessing such R functionality,
allows one to benefit from both the data-handling features of PLINK
(i.e. it being designed specifically to handle large SNP datasets
efficiently, in a way that the basic R package is not) as well as the
ever-increasing library of statistical tools in R. Also, this should
provide an easy way to prototype new methods, etc.
Currently there is only support for SNP-based analyses. As of version
1.05, multiple values can be returned for each SNP, as defined by the
user. Potentially (if there is interest/specific suggestions) these
features will be expanded to allow other units of analysis and
broader communcation with R.
Note Currently, there is only support for
R-plugins for Linux-based and Mac OS PLINK distributions.
Note Version 1.04 onwards of PLINK has updated
the client code to support the latest version of Rserve. You should
re-install Rserve (see notes below) to make sure you have the latest
version.
Basic usage for R plug-ins
Assuming Rserve has been installed and is running locally
(see below) and that the file
myscript.R contains the R code conforming to the standard for a
PLINK plug-in (see here), then the command is simply
plink --file mydata --R myscript.R
which generates a file
plink.auto.R
This file contains the raw output for each SNP, which is whatever vector
of numeric values the user returned from their script, and some details about
the SNP. There is no header row; each row has the following fields.
Chromosome position
SNP id
Physical position (base-pair)
Minor allele (A1)
First return value from R-plugin
Second return value from R-plugin
...
Depending on how you set up the R script, each row may or may not have the same
number of columns. Currently it is not possible to return strings of other R
objects.
If Rserve is running on anything other than the default port, you can specify an alternate port number by adding
--R-port 8221
for example.
Defining the R plug-in function
PLINK expects a function in the exact form
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
to be defined in the supplied file. This function is expected to return
a numeric vector, with as many elements are there are SNPs. Internally,
PLINK will call the Rplink function -- it must be written exactly
as shown here. The objects refer to:
PHENO vector of phenotypes (n)
GENO matrix of genotypes (n x l)
CLUSTER vector of cluster membership codes (n)
COVAR matrix of covariates (n x c)
where n is the number of individuals (after pruning, filtering,
etc) and c is the number of covariates (if any). PLINK
generates these objects internally, so the user can assume these exist
for when the Rplink() function is called. (In practice, the
number of SNPs, l will probably be smaller than the total number
of SNPs in the file, as PLINK passes the genotype data into R in batches
rather than all in one go).
Genotypes are coded 0, 1 or 2 copies of the minor allele,
and NA, as per the --recodeA option.
For each SNP, PLINK expects the function to return a numeric vector of values. This need not
have the same number of values for each SNP (although this will make subsequently parsing of the
output file harder, potentionally). If the desired return vector is r, then the actual
return vector must be
c( length(r) , r )
That is, PLINK expects back a long string of values, where it reads how many values to read for the first SNP, reads
them, then reads how many values to read for the second SNP, reads them, etc. By also using the abve formulation to
specify the return vector, PLINK will be able to parse the output.
An example R plug-in is shown here -- this is probably the most
straightforward template for an R-plugin, in which the apply()
function is used to iteratively call the nested function (f1()),
once per SNP, in this case. For example, the file myscript.R
might contain the following plug-in:
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(x)
{
r <- mean(x, na.rm=T) / 2
c( length(r) , r )
}
as.numeric( apply(GENO, 2 , f1) )
}
If you are not familiar with the R language, there are a number of
excellent resources available from the main R webpage.
Within the body of the main Rplink() function, there are no
constraints on what you can do, as long as the return value is in the
proper format, as described above. In this example, within the main
body of the Rplink() function we first define a function
that will be applied to each SNP, called f1(). Unlike
the Rplink() function, you can call this whatever you want,
or have as many functions as you want. The function f1()
calculates the allele frequency for each
SNP (as the genotypes are coded as the count of the minor allele,
0,1,2). The second line applies this function to each column of the
genotype data, using the apply( data , row/col , function
) command.
Another, perhaps more useful, example is implementing survival analysis within PLINK: here
we define a function, f1() to return the p-value for the first coefficient; we assume
here that a censoring variable was loaded into PLINK as the first covariate (i.e. the R Surv
function takes two parameters, the survival time and censoring status). (This is probably not the optimal
way to implement this analysis, but is intended purely as an example of what can be done.)
library(survival)
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s)
{
m <- summary( coxph( Surv( PHENO , COVAR[,1] ) ~ s ) )
r <- c( m$coef , m$loglik, m$n )
c( length(r) , r )
}
apply( GENO , 2 , f1 )
}
In other words, the general format is
{ load any libraries or auxiliary data from a file first }
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(x)
{
{ do something to generate per-SNP return vector r }
c( length(r) , r )
}
apply( GENO , 2 , f1 )
}
Example of debugging an R plug-in
To generate a text file that contains the R commands PLINK would have
run (rather than actually trying to run them -- this is useful for
debugging purposes), add the following flag
plink --file mydata --R myscript.R --R-debug
To illustrate the debug function, consider this example, in which we
try to implement a logistic regression. The file
mylog.R
which contains the function
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s)
{
m <- glm( PHENO ~ s , family="binomial" )
r <- summary(m)$coef[8]
c( length(r) , r )
}
apply( GENO , 2 , f1 )
}
and we have a dataset with three SNPs; the internal PLINK logistic regression command
plink --file mydata --logistic
yields
CHR SNP BP A1 TEST NMISS ODDS STAT P
1 snp0 10000 A ADD 200 1.256 1.15 0.2501
1 snp1 10001 B ADD 200 0.9028 -0.5112 0.6092
1 snp2 10002 B ADD 200 0.6085 -2.242 0.02499
Trying to run the R implementation:
plink --file mydata --R mylog.R
we obtain a set of invalid p-values in plink.auto.R
1 snp0 10000 A NA
1 snp1 10001 B NA
1 snp2 10002 B NA
To find out what is happening, we will run the same command with the debug option
plink --file mydata --R mylog.R --R-debug
This writes to the file plink.auto.R the actual commands that would be passed to
R, including the data and the function:
n <- 200
PHENO <- c( 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,
2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1,
2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1,
2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1,
1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1,
1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1,
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1,
2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1 )
COVAR <- matrix( NA , nrow = n , ncol = 0 , byrow = T)
CLUSTER <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
l <- 3
g <- c( 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 1, 1, 0, 1, 2, 1, 1, 1,
2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1,
1, 1, 2, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 2, 1, 1, 0, 1, 1,
1, 1, 2, 2, 2, 1, 0, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
0, 2, 2, 1, 0, 0, 0, 1, 0, 1, 2, 2, 2, 1, 0, 0, 0, 2, 1, 2, 2, 1, 1,
1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 2, 0, 1,
1, 1, 1, 2, 1, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1,
2, 2, 1, 2, 1, 1, 2, 2, 0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 2, 2, 2, 0,
1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0,
0, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1,
1, 1, 1, 2, 1, 2, 0, 1, 1, 0, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0, 0, 0, 1,
1, 2, 1, 1, 1, 0, 2, 1, 0, 2, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1,
2, 2, 2, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 0, 0, 2, 0, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 0, 2, 1, 1, 1, 2,
2, 2, 1, 1, 0, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 0,
0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1,
0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2,
2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, 1, 1, 0,
1, 0, 1, 1, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 0, 2, 0, 2, 0, 0, 1, 1,
1, 1, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 0,
0, 1, 0, 1, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 0, 2, 2, 2, 2, 2, 2, 1,
0, 2, 1, 2, 1, 1, 1, 1, 2, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 2, 1, 1, 0,
1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 1, 2, 1,
0, 2, 0, 0, 1, 0, 2, 1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2,
0, 1, 2, 1 )
GENO <- matrix( g , nrow = n ,byrow=T)
GENO[GENO == -1 ] <- NA
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s) {
m <- glm( PHENO-1 ~ s , family="binomial" )
r <- summary(m)$coef[8]
c( length(r), r)
}
apply( GENO , 2 , f1 )
}
In R, load this function
source("plink.auto.R")
and then try to run the Rplink function
Rplink(PHENO,GENO,CLUSTER,COVAR)
and you will see the error message
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
which indicates that R is expecting a 0/1 coding for this
particular function, not the default 1/2 coding used by PLINK for the
phenotype/dependent variable. You might therefore want to change the
relevant line of the function from
m <- glm( PHENO ~ s , family="binomial" )
to
m <- glm( PHENO==2 ~ s , family="binomial" )
for example. Then, repeating the above debug procedure, you would see in R
Rplink(PHENO,GENO,CLUSTER,COVAR)
gives
[1] 0.25013412 0.60921037 0.02499268
which are the correct p-values. So, now the function is fixed running
plink --file mydata --R mylog.R
would generate the same set of p-values as the PLINK
logistic command, in plink.auto.R
1 snp0 10000 A 0.250134
1 snp1 10001 B 0.60921
1 snp2 10002 B 0.0249927
This basic function could then be extended to return the coefficients also, or to use different analytic approaches
available in R.
Setting up the Rserve package
First, you must ensure that you have Rserve installed on your
system. Normally, this will involve just typing, at the R command
prompt (not the system shell prompt)
install.packages("Rserve")
HINT For this to work, R must have been configured with
--enable-R-shlib.
When using any R-based PLINK plug-in, Rserve must be running in the
background before invoking the PLINK command. To start Rserve, just type
at the shell prompt
R CMD Rserve
(note, you may need to change Rserve to the full path of where
Rserve was installed), or, within R, type at the R prompt
library(Rserve)
Rserve()
Please see the Rserve
documentation for further support.
|
|