PLINK: Whole genome data analysis toolset

Rare variant burden tests

This page describes the methods currently in PLINK to evaluate genes or sets of SNPs for association with disease, with a focus on the burden of rare and low-frequency mutations as assayed by resequencing studies.

The primary method, the variable-threshold test, is described in Price et al (American Journal of Human Genetics, 2010). Please refer to that manuscript for methodological details, and cite that manuscript as well as PLINK if you use this test in a publication.

Variable threshold test

This test is appropriate for samples of unrelated individuals, to assess association between multiple rare and low-frequency variants in one or more genes (or, more generically, any set of variants) and a dichotomous or quantitative phenotype.

For each set tested, the threshold for including variants in the test (e.g. all variants below 1% minor allele frequency) is automatically optimised. False positive rates are controlled by use of a permutation procedure, that repeats the same optimisation for each permuted dataset. Weights for each variant (e.g. from PolyPhen2) can be included. Currently other covariates are not supported.

To perform a variable-threshold burden test, the basic command is --vt-test, along with the number of permutations to be performed:

plink --bfile mydata --vt-test --mperm 10000

As with all PLINK commands, the genotype data can be loaded into PLINK from a variety of formats: in this example we've loaded a binary fileset (--bfile), but we could also use any of the formats described here.

Note The --mperm command, which specifies permutations to be applied, is required for this test.

This analysis produces three output files: the file
     plink.vt     
has the fields (for a quantitative trait):
      SET   Name of the set (or ALL if no sets specified)
     NSNP   Total number of SNPs in the set
       TC   Threshold for inclusion, number of minor alleles
       TF   Threshold for inclusion, as minor allele frequency
    NSNP2   Number of SNPs included in the test (i.e. below threshold)
     CNT1   Number of individuals with at least 1 included rare variant
     CNT0   Number of individuals with no included rare variants
    MEAN1   Phenotypic mean for individuals with 1 or more rare variants
    MEAN0   Phenotypic mean for individuals with no included rare variants

whereas for a case/control outcome would list CNTA and CNTU instead, which are the number of alleles from included variants for cases and controls respectively.

For each set, the variants included in the test (i.e. passing the frequency threshold) are listed in the file:

     plink.vt.var
which contains the fields
      SET   Name of set
      SNP   Name of each included SNP
      WGT   Weight (if any) for this variant
      CNT   Number of minor alleles observed in sample
        F   Minor allele frequency of this variant
   ATTRIB   Any attributes specified (see below)
Finally, the file
     plink.vt.mperm
contains the fields
     SET    Set name
    EMP1    Empirical p-value for this set
NOTE By default, the test is 1-sided test, and assumes that a risk allele increases a quantitative score (or increases risk for disease). The command
     --vt-test-low
will reverse this.

HINT As with other PLINK tests, for stratified samples the --within command (that takes a cluster file) can be used to constrain the permutation (swapping of phenotypes) to be only between individuals within the same cluster (thereby preserving any between cluster association with phenotype and or genotype).

Defining sets

Sets define the groups of SNPs to which the test is applied, which will commonly be genes or groups of genes. As described here and here, any PLINK option for this can be used, e.g.
     --set myset.set 
where myset.set is a file
  GENE1
   snp1
   snp2
  END

  GENE2
   snp2
   snp3
   snp4
  END
or, alternatively,
  --make-set  myset.dat
where myset.dat is in format of each line containing four fields:
  CHR  BP1 BP2  GENE-NAME
This second form can be combined with --make-set-border to specify a kb interval around each gene to be included.

If no sets are specified, then all SNPs in the file will be included in the test; otherwise, the test will be performed separately for each set.

Variant weights

Weights can be applied to each variant, for example, to represent the probability that a missense variant has a deleterious impact on protein function.

plink --bfile mydata --vt-test --mperm 10000 --weights myweights.txt

We assume weights are coded on scale of 0 to 1, with a lower number meaning a lower weight. Weights will be censored at these values. If the --weights command is included, then variants in the dataset but not listed in the weight file will be assigned a defaul weight of 0.

The format of the weights file is a text file with exactly two entries per line:

   SNP  WEIGHT
If the command
   --ppweights
is used instead of --weights, the behavior is identical except that PLINK will assume these are weights from PolyPhen2 and apply an adjustment to the weight for common SNPs (above 1% MAF), by setting it equal to 0.5 if the weight is less than 1.0.

Attributes

Optionally, attributes can be used to filter the dataset (described here), or to append information to the results for the variable-threshold test. Specifically:
  --attrib {file}                   Include attrib info as column in plink.vt.var
  --filter-attrib {file} {attrib}   Pre-filter on attributes, e.g. only missense SNPs
Attribute files should have the format of one SNP per line, starting with the SNP ID and then a list of whitespace-delimited attributes:
     SNP  attrib1  attrib2 ...
For example, with the file mysnps.txt
   rs00001   missense
   rs00002   missense
   rs00003   synon
   rs00005   nonsense 
the command

plink --bfile mydata --vt-test --mperm 10000 --attrib mysnps.txt

would append to the plink.vt.var file this information where present
     SET      SNP      WGT      CNT            F   ATTRIB
     ALL  rs00001        1        2    0.0002877   missense
     ALL  rs00002        1        1    0.0001438   missense
     ALL  rs00003        1        1    0.0001438   synon
     ALL  rs00004        1        1    0.0001438
     ALL  rs00005        1        3    0.0004315   nonsense
     ALL  rs00006        1       10     0.001438
     ...
whereas adding the option
     --filter-attrib myssnps.txt missense
would only include SNPs with the missense atttribute in analysis.

Attributes are defined by the user rather than being hard-coded into PLINK (i.e. and so could represent any type of meta-information or be coded differently, i.e the labels Mis, M, etc, could be used instead of, or as well as, missense).

An example of alternative input formats

Here we illustrate an alternate input format for the rare variant tests (but that is also applicable to any PLINK analysis): the case in which one only has a list of minor/non-reference allele counts, for example, where each line of a file represents the fields:
   FID    IID    SNP-ID     ALLELE-COUNT 
Here, one could use the following:

./plink --lfile data1 --reference data1.ref --allele-count { etc ... }

where we expect a FAM, MAP, LGEN and reference file as described here.

Other rare-variant burden tests

Two other tests that are similar to the variable-threshold test are also available in this same framework. If, instead of --vt-test, you specify
     --fw-test
then PLINK will perform the frequency-weighted Madsen/Browning test. If you instead specify,
     --rv-test 0.01
then PLINK will perform a fixed-threshold rare-variant test: in this example, all minor alleles with sample frequency below 1% would be included.

The --weights (or --ppweights) command can be combined with both of these tests. For the frequency-weighted test, the weight for each variant becomes the product of the frequency-weight and the user-specified weight. The --vt-test-low command still applies for these two tests, also.

Currently, these commands only produce an empirical p-value output file, named either plink.fw.mperm or plink.rv.mperm. This document last modified