This page describes the utility features in PLINK to apply generic annotations to various types of SNP-centric files. To automatically apply information about whether SNPs are functional, or tag functional variants, and which genes they are in or near, requires only to download two files (here and here) and run a single --annotate command as described below.

Basic usage

The basic command to annotate a result file is

plink --annotate myfile.assoc attrib=snp129.attrib.gz ranges=glist.txt

which creates a file

     plink.annot

which contains all the fields in myfile.assoc but with the annotation data appended in the rightmost column.

Note that the --annotate command takes only a single fixed argument: the name of the file to be annotated. All other keywords that follow are options. Note how they are listed differently in the LOG file:

        --annotate tmp.1
          attrib=snp129.attrib.gz
          ranges=glist.txt

See this link for more details about options.

An attrib and/or a ranges keyword/file pair must be specified.

For example, consider a file myfile.assoc that contains the following information in the first few rows:

  CHR         SNP        BP         P
    1   rs3094315    792429    0.1521
    1   rs6672353    817376    0.3649
    1   rs4040617    819185    0.2315
    1   rs4075116   1043552    0.3453
    1   rs9442385   1137258    0.3968

Second, we have a list of attributes in the file snp129.attrib.gz, which is a compressed file that (when uncompressed) is in the format:

    SNP-identifier  attribute1 attribute2 ...

where the attributes are any user-defined text fields. In this example, the attributes relate to the functional status of each SNP, e.g. nonsense, missense, frameshift, etc. In this particular case, we use upper-case to indicate a SNP is actually coding; lower-case indicates that the SNP is in strong linkage disequilibrium with a coding SNP. Also, each attribute begins with an equals sign, to make a clear distinction between an attribute and any gene names (see below). These conventions are not specified in any way by the --annotate command itself, however.

     rs12568050 =MISSENSE
     rs443143 =missense
     rs4758895 =missense
     rs6497638 =nonsense =missense
     rs2593389 =missense
     rs4446721 =frameshift
     ...

If the attribute file ends in .gz, and ZLIB support is available to PLINK, then it will be automatically read and decompressed on the fly. If the attribute file does not end of .gz, it is assumed to be a standard plain-text file.

NOTE The snp129.attrib.gz file discussed here is available from the resources page.

Third, we have a list of gene names and co-ordinates. This is the file specified after the ranges keyword, assumed to be in the standard range format for PLINK: chromosome, start position, stop position, name (and optional group name in the fifth field), e.g.

     19  63549983   63556677  A1BG
     10  52236330   52315441  A1CF
     8   43266741   43337485  A26A1
     15  19305252   19336667  A26B1
     21  13904368   13935777  A26B3
     2   131692393 131738886  A26C1A
     ...

In this example, the ranges correspond to genes, although they could in practice correspond to any type of intervals. That is, the --annotate function can be used with any generic set of ranges, as defined by the user (e.g. with regions corresponding to linkage peaks, regions under positive selection, etc).

NOTE The glist.txt file discussed here is also available from the resources page.

Given these three files, the --annotate command will append the attribute and range information, where appropriate, to the input file, e.g. plink.annot might begin:

  CHR         SNP        BP         P   ANNOT
    1   rs3094315    792429    0.1521   =missense
    1   rs6672353    817376    0.3649   .
    1   rs4040617    819185    0.2315   =missense
    1   rs4075116   1043552    0.3453   C1orf159(+1.953kb)
    1   rs9442385   1137258    0.3968   TNFRSF4(0)|TNFRSF18(+5.306kb)|SDF4(-4.892kb)
    ...

for example, indicating that rs3094315 is in strong LD with a missense SNP, and that rs9442385 is in the gene TNFRSF4, about 5kb away from two other genes, TNFRSF18 and SDF4.

NOTE It is not required for the input file to have CHR and BP fields if ranges are not applied (i.e. attributes are assigned to SNPs based solely on the unique identifier/rs-number, not genomic location). Similarly, the P field is not required, unless --pfilter has been specified.

Misc. options

There are several options that can modify the behavior of --annotate.

Filters

To filter on regions (so the plink.annot file only contains SNPs in those regions) use

     filter=myreg.txt

where myreg.txt is in the same format as the gene/range list above.

To only include a specific set of SNPs from the input file, use

     snps=mysnps.txt

where mysnps.txt is just a list of SNP IDs.

To only apply a subset of the ranges for annotation, the

     subset=myfile.txt

where myfile.txt is a list of range names (i.e. corresponding to the file specified by ranges=).

To ouput only SNPs that have at least some annotation, use the option

     prune

To filter based on p-value, if that field is present (in header, the P field), use the separate command (i.e. not an option, so has --):

     --pfilter 0.05

Output options

To alter the format of the output file, so that a series of 0 and 1 variables are output for each attribute and/or range, use the option

     block

For example, instead of

    SNP  CHR    BP   ANNOT 
  rs001    1  1111   .
  rs002    1  2222   =NONSENSE
  rs003    1  3333   =nonsense
  rs004    1  4444   .

the plink.annot file would contain

    SNP  CHR    BP   =NONSENSE  =nonsense
  rs001    1  1111   0          0
  rs002    1  2222   1          0
  rs003    1  3333   0          1
  rs004    1  4444   0          0

To place a NA symbol instead of . in the ANNOT field when no annotation is found, add the option

NA

This can make files easier to read into statistic packages, for example.

To specify a particular border for genes/ranges (i.e. such that genes/ranges within X kb of the SNP are reported as near that SNP), use the command, e.g. for a 20kb border,

     --border 20

To only list the gene/range name, and not the kb distance following it, add the option:

     minimal

To generate an additional output field that contains the kb distance to the nearest gene, and a field indicating whether the nearest gene is upstream or downstream (+, -), add the option:

     distance

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST