PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Result annotation

 

This page describes the utility features in PLINK to apply generic annotations to various types of SNP-centric files. To automatically apply information about whether SNPs are functional, or tag functional variants, and which genes they are in or near, requires only to download two files (here and here) and run a single --annotate command as described below.

Basic usage

The basic command to annotate a result file is
plink --annotate myfile.assoc attrib=snp129.attrib.gz ranges=glist.txt

which creates a file
     plink.annot
which contains all the fields in myfile.assoc but with the annotation data appended in the rightmost column.

Note that the --annotate command takes only a single fixed argument: the name of the file to be annotated. All other keywords that follow are options. Note how they are listed differently in the LOG file:
        --annotate tmp.1
          attrib=snp129.attrib.gz
          ranges=glist.txt
See this link for more details about options.

An attrib and/or a ranges keyword/file pair must be specified.

For example, consider a file myfile.assoc that contains the following information in the first few rows:
  CHR         SNP        BP         P
    1   rs3094315    792429    0.1521
    1   rs6672353    817376    0.3649
    1   rs4040617    819185    0.2315
    1   rs4075116   1043552    0.3453
    1   rs9442385   1137258    0.3968
Second, we have a list of attributes in the file snp129.attrib.gz, which is a compressed file that (when uncompressed) is in the format:
    SNP-identifier  attribute1 attribute2 ...
where the attributes are any user-defined text fields. In this example, the attributes relate to the functional status of each SNP, e.g. nonsense, missense, frameshift, etc. In this particular case, we use upper-case to indicate a SNP is actually coding; lower-case indicates that the SNP is in strong linkage disequilibrium with a coding SNP. Also, each attribute begins with an equals sign, to make a clear distinction between an attribute and any gene names (see below). These conventions are not specified in any way by the --annotate command itself, however.

     rs12568050 =MISSENSE
     rs443143 =missense
     rs4758895 =missense
     rs6497638 =nonsense =missense
     rs2593389 =missense
     rs4446721 =frameshift
     ...
If the attribute file ends in .gz, and ZLIB support is available to PLINK, then it will be automatically read and decompressed on the fly. If the attribute file does not end of .gz, it is assumed to be a standard plain-text file.

NOTE The snp129.attrib.gz file discussed here is available from the resources page.

Third, we have a list of gene names and co-ordinates. This is the file specified after the ranges keyword, assumed to be in the standard range format for PLINK: chromosome, start position, stop position, name (and optional group name in the fifth field), e.g.
     19  63549983   63556677  A1BG
     10  52236330   52315441  A1CF
     8   43266741   43337485  A26A1
     15  19305252   19336667  A26B1
     21  13904368   13935777  A26B3
     2   131692393 131738886  A26C1A
     ...
In this example, the ranges correspond to genes, although they could in practice correspond to any type of intervals. That is, the --annotate function can be used with any generic set of ranges, as defined by the user (e.g. with regions corresponding to linkage peaks, regions under positive selection, etc).

NOTE The glist.txt file discussed here is also available from the resources page.

Given these three files, the --annotate command will append the attribute and range information, where appropriate, to the input file, e.g. plink.annot might begin:
  CHR         SNP        BP         P   ANNOT
    1   rs3094315    792429    0.1521   =missense
    1   rs6672353    817376    0.3649   .
    1   rs4040617    819185    0.2315   =missense
    1   rs4075116   1043552    0.3453   C1orf159(+1.953kb)
    1   rs9442385   1137258    0.3968   TNFRSF4(0)|TNFRSF18(+5.306kb)|SDF4(-4.892kb)
    ...
for example, indicating that rs3094315 is in strong LD with a missense SNP, and that rs9442385 is in the gene TNFRSF4, about 5kb away from two other genes, TNFRSF18 and SDF4.

NOTE It is not required for the input file to have CHR and BP fields if ranges are not applied (i.e. attributes are assigned to SNPs based solely on the unique identifier/rs-number, not genomic location). Similarly, the P field is not required, unless --pfilter has been specified.

Misc. options

There are several options that can modify the behavior of --annotate.

Filters

To filter on regions (so the plink.annot file only contains SNPs in those regions) use
     filter=myreg.txt
where myreg.txt is in the same format as the gene/range list above.

To only include a specific set of SNPs from the input file, use
     snps=mysnps.txt
where mysnps.txt is just a list of SNP IDs.

To only apply a subset of the ranges for annotation, the
     subset=myfile.txt
where myfile.txt is a list of range names (i.e. corresponding to the file specified by ranges=).

To ouput only SNPs that have at least some annotation, use the option
     prune

To filter based on p-value, if that field is present (in header, the P field), use the separate command (i.e. not an option, so has --):
     --pfilter 0.05

Output options

To alter the format of the output file, so that a series of 0 and 1 variables are output for each attribute and/or range, use the option
     block
For example, instead of
    SNP  CHR    BP   ANNOT 
  rs001    1  1111   .
  rs002    1  2222   =NONSENSE
  rs003    1  3333   =nonsense
  rs004    1  4444   .
the plink.annot file would contain
    SNP  CHR    BP   =NONSENSE  =nonsense
  rs001    1  1111   0          0
  rs002    1  2222   1          0
  rs003    1  3333   0          1
  rs004    1  4444   0          0
To place a NA symbol instead of . in the ANNOT field when no annotation is found, add the option
     NA
This can make files easier to read into statistic packages, for example.

To specify a particular border for genes/ranges (i.e. such that genes/ranges within X kb of the SNP are reported as near that SNP), use the command, e.g. for a 20kb border,
     --border 20
To only list the gene/range name, and not the kb distance following it, add the option:
     minimal
To generate an additional output field that contains the kb distance to the nearest gene, and a field indicating whether the nearest gene is upstream or downstream (+, -), add the option:
     distance

 

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST