PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

R plugin functions

This page describes PLINK's limited support for R-based 'plug-in' functions. In this manner, users can extend the basic functionality of PLINK to better meet their own needs.

R is a powerful, freely-available package for statistical computing. PLINK uses the Rserve package to communicate with R. There are some notes on installing and running the Rserve package below.

The idea is that some analyses, such as survival analysis for example, are already implemented in R but not available in PLINK. Having a simple interface for accessing such R functionality, allows one to benefit from both the data-handling features of PLINK (i.e. it being designed specifically to handle large SNP datasets efficiently, in a way that the basic R package is not) as well as the ever-increasing library of statistical tools in R. Also, this should provide an easy way to prototype new methods, etc.

Currently there is only support for SNP-based analyses. As of version 1.05, multiple values can be returned for each SNP, as defined by the user. Potentially (if there is interest/specific suggestions) these features will be expanded to allow other units of analysis and broader communcation with R.

Note Currently, there is only support for R-plugins for Linux-based and Mac OS PLINK distributions.

Note Version 1.04 onwards of PLINK has updated the client code to support the latest version of Rserve. You should re-install Rserve (see notes below) to make sure you have the latest version.

Basic usage for R plug-ins

Assuming Rserve has been installed and is running locally (see below) and that the file myscript.R contains the R code conforming to the standard for a PLINK plug-in (see here), then the command is simply

plink --file mydata --R myscript.R

which generates a file
     plink.auto.R
This file contains the raw output for each SNP, which is whatever vector of numeric values the user returned from their script, and some details about the SNP. There is no header row; each row has the following fields.
     Chromosome position
     SNP id
     Physical position (base-pair)
     Minor allele (A1)
     First return value from R-plugin
     Second return value from R-plugin
     ...
Depending on how you set up the R script, each row may or may not have the same number of columns. Currently it is not possible to return strings of other R objects.

If Rserve is running on anything other than the default port, you can specify an alternate port number by adding
     --R-port 8221
for example.

Defining the R plug-in function

PLINK expects a function in the exact form
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
to be defined in the supplied file. This function is expected to return a numeric vector, with as many elements are there are SNPs. Internally, PLINK will call the Rplink function -- it must be written exactly as shown here. The objects refer to:
     PHENO     vector of phenotypes (n)
     GENO      matrix of genotypes (n x l)
     CLUSTER   vector of cluster membership codes (n)
     COVAR     matrix of covariates (n x c)
where n is the number of individuals (after pruning, filtering, etc) and c is the number of covariates (if any). PLINK generates these objects internally, so the user can assume these exist for when the Rplink() function is called. (In practice, the number of SNPs, l will probably be smaller than the total number of SNPs in the file, as PLINK passes the genotype data into R in batches rather than all in one go).

Genotypes are coded 0, 1 or 2 copies of the minor allele, and NA, as per the --recodeA option.

For each SNP, PLINK expects the function to return a numeric vector of values. This need not have the same number of values for each SNP (although this will make subsequently parsing of the output file harder, potentionally). If the desired return vector is r, then the actual return vector must be
     c( length(r) , r )
That is, PLINK expects back a long string of values, where it reads how many values to read for the first SNP, reads them, then reads how many values to read for the second SNP, reads them, etc. By also using the abve formulation to specify the return vector, PLINK will be able to parse the output.

An example R plug-in is shown here -- this is probably the most straightforward template for an R-plugin, in which the apply() function is used to iteratively call the nested function (f1()), once per SNP, in this case. For example, the file myscript.R might contain the following plug-in:
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{

 f1 <- function(x)   
 { 
    r <- mean(x, na.rm=T) / 2 
    c( length(r) , r )  
 }

 as.numeric( apply(GENO, 2 , f1) )
}
If you are not familiar with the R language, there are a number of excellent resources available from the main R webpage.

Within the body of the main Rplink() function, there are no constraints on what you can do, as long as the return value is in the proper format, as described above. In this example, within the main body of the Rplink() function we first define a function that will be applied to each SNP, called f1(). Unlike the Rplink() function, you can call this whatever you want, or have as many functions as you want. The function f1() calculates the allele frequency for each SNP (as the genotypes are coded as the count of the minor allele, 0,1,2). The second line applies this function to each column of the genotype data, using the apply( data , row/col , function ) command.

Another, perhaps more useful, example is implementing survival analysis within PLINK: here we define a function, f1() to return the p-value for the first coefficient; we assume here that a censoring variable was loaded into PLINK as the first covariate (i.e. the R Surv function takes two parameters, the survival time and censoring status). (This is probably not the optimal way to implement this analysis, but is intended purely as an example of what can be done.)
     library(survival)
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
      f1 <- function(s) 
      {
        m <- summary( coxph( Surv( PHENO , COVAR[,1] ) ~ s ) )
        r <- c( m$coef , m$loglik, m$n )
        c( length(r) , r )
      }
     apply( GENO , 2 , f1 )
     }
In other words, the general format is
     { load any libraries or auxiliary data from a file first }
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
      f1 <- function(x)
      {
       { do something to generate per-SNP return vector r }
       c( length(r) , r )
      }
     apply( GENO , 2 , f1 )
     }

Example of debugging an R plug-in

To generate a text file that contains the R commands PLINK would have run (rather than actually trying to run them -- this is useful for debugging purposes), add the following flag

plink --file mydata --R myscript.R --R-debug

To illustrate the debug function, consider this example, in which we try to implement a logistic regression. The file
     mylog.R
which contains the function
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
       f1 <- function(s) 
       {
        m <- glm( PHENO ~ s , family="binomial" )
        r <- summary(m)$coef[8]
        c( length(r) , r )
       }
      apply( GENO , 2 , f1 )
     }
and we have a dataset with three SNPs; the internal PLINK logistic regression command
plink --file mydata --logistic

yields
     CHR  SNP         BP   A1       TEST    NMISS       ODDS         STAT            P
       1 snp0      10000    A        ADD      200      1.256         1.15       0.2501
       1 snp1      10001    B        ADD      200     0.9028      -0.5112       0.6092
       1 snp2      10002    B        ADD      200     0.6085       -2.242      0.02499
Trying to run the R implementation:
plink --file mydata --R mylog.R

we obtain a set of invalid p-values in plink.auto.R
     1 snp0 10000  A  NA
     1 snp1 10001  B  NA
     1 snp2 10002  B  NA
To find out what is happening, we will run the same command with the debug option
plink --file mydata --R mylog.R --R-debug

This writes to the file plink.auto.R the actual commands that would be passed to R, including the data and the function:
     n <- 200
     
     PHENO <- c( 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2,
     1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,
     2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1,
     2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1,
     2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1,
     1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1,
     1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1,
     1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1,
     2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1 )
     
     COVAR <- matrix( NA , nrow = n , ncol = 0 , byrow = T)
     
     CLUSTER <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) 
     
     l <- 3 
     
     g <- c( 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 1, 1, 0, 1, 2, 1, 1, 1,
     2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
     1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1,
     1, 1, 2, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 2, 1, 1, 0, 1, 1,
     1, 1, 2, 2, 2, 1, 0, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
     0, 2, 2, 1, 0, 0, 0, 1, 0, 1, 2, 2, 2, 1, 0, 0, 0, 2, 1, 2, 2, 1, 1,
     1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 2, 0, 1,
     1, 1, 1, 2, 1, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1,
     2, 2, 1, 2, 1, 1, 2, 2, 0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 2, 2, 2, 0,
     1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0,
     0, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1,
     1, 1, 1, 2, 1, 2, 0, 1, 1, 0, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0, 0, 0, 1,
     1, 2, 1, 1, 1, 0, 2, 1, 0, 2, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1,
     2, 2, 2, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 0, 0, 2, 0, 2, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 0, 2, 1, 1, 1, 2,
     2, 2, 1, 1, 0, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 0,
     0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1,
     0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 1, 1, 0, 1, 1,
     1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2,
     2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, 1, 1, 0,
     1, 0, 1, 1, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 0, 2, 0, 2, 0, 0, 1, 1,
     1, 1, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 0,
     0, 1, 0, 1, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 0, 2, 2, 2, 2, 2, 2, 1,
     0, 2, 1, 2, 1, 1, 1, 1, 2, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 2, 1, 1, 0,
     1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 1, 2, 1,
     0, 2, 0, 0, 1, 0, 2, 1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2,
     0, 1, 2, 1 ) 
     
     GENO <- matrix( g , nrow = n ,byrow=T) 
     GENO[GENO == -1 ] <- NA
          
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
     f1 <- function(s) {
        m <- glm( PHENO-1 ~ s , family="binomial" )
        r <- summary(m)$coef[8]
        c( length(r), r)
      }
     apply( GENO , 2 , f1 )
     }
In R, load this function
source("plink.auto.R")

and then try to run the Rplink function
Rplink(PHENO,GENO,CLUSTER,COVAR)

and you will see the error message
     Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
which indicates that R is expecting a 0/1 coding for this particular function, not the default 1/2 coding used by PLINK for the phenotype/dependent variable. You might therefore want to change the relevant line of the function from
     m <- glm( PHENO ~ s , family="binomial" )
to
     m <- glm( PHENO==2 ~ s , family="binomial" )
for example. Then, repeating the above debug procedure, you would see in R
Rplink(PHENO,GENO,CLUSTER,COVAR)

gives
     [1] 0.25013412 0.60921037 0.02499268
which are the correct p-values. So, now the function is fixed running
plink --file mydata --R mylog.R

would generate the same set of p-values as the PLINK logistic command, in plink.auto.R
     1 snp0 10000  A  0.250134
     1 snp1 10001  B  0.60921
     1 snp2 10002  B  0.0249927
This basic function could then be extended to return the coefficients also, or to use different analytic approaches available in R.

Setting up the Rserve package

First, you must ensure that you have Rserve installed on your system. Normally, this will involve just typing, at the R command prompt (not the system shell prompt)
install.packages("Rserve")

HINT For this to work, R must have been configured with --enable-R-shlib.

When using any R-based PLINK plug-in, Rserve must be running in the background before invoking the PLINK command. To start Rserve, just type at the shell prompt
R CMD Rserve

(note, you may need to change Rserve to the full path of where Rserve was installed), or, within R, type at the R prompt
library(Rserve)
Rserve()

Please see the Rserve documentation for further support.
 
This document last modified Wednesday, 25-Jan-2017 11:39:28 EST