1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Result annotation
This page describes the utility features in PLINK to apply generic
annotations to various types of SNP-centric files. To
automatically apply information about whether SNPs are functional, or
tag functional variants, and which genes they are in or near, requires
only to download two files (here and here) and run a single --annotate
command as described below.
Basic usage
The basic command to annotate a result file is
plink --annotate myfile.assoc attrib=snp129.attrib.gz ranges=glist.txt
which creates a file
plink.annot
which contains all the fields in myfile.assoc but with the annotation data appended in the rightmost column.
Note that the --annotate command takes only a single fixed
argument: the name of the file to be annotated. All other keywords
that follow are options. Note how they are listed differently
in the LOG file:
--annotate tmp.1
attrib=snp129.attrib.gz
ranges=glist.txt
See this link for more
details about options.
An attrib and/or a ranges keyword/file pair must be specified.
For example, consider a file myfile.assoc that contains the following
information in the first few rows:
CHR SNP BP P
1 rs3094315 792429 0.1521
1 rs6672353 817376 0.3649
1 rs4040617 819185 0.2315
1 rs4075116 1043552 0.3453
1 rs9442385 1137258 0.3968
Second, we have a list of attributes in the file
snp129.attrib.gz, which is a compressed file that (when
uncompressed) is in the format:
SNP-identifier attribute1 attribute2 ...
where the attributes are any user-defined text fields. In this
example, the attributes relate to the functional status of each SNP,
e.g. nonsense, missense, frameshift, etc. In this particular case, we
use upper-case to indicate a SNP is actually coding; lower-case
indicates that the SNP is in strong linkage disequilibrium with a
coding SNP. Also, each attribute begins with an equals sign, to make a
clear distinction between an attribute and any gene names (see
below). These conventions are not specified in any way by the
--annotate command itself, however.
rs12568050 =MISSENSE
rs443143 =missense
rs4758895 =missense
rs6497638 =nonsense =missense
rs2593389 =missense
rs4446721 =frameshift
...
If the attribute file ends in .gz, and ZLIB support is
available to PLINK, then it will be automatically read and
decompressed on the fly. If the attribute file does not end of
.gz, it is assumed to be a standard plain-text file.
NOTE The snp129.attrib.gz file discussed
here is available from the resources
page.
Third, we have a list of gene names and co-ordinates. This is the file
specified after the ranges keyword, assumed to be in the
standard range format for PLINK: chromosome, start position,
stop position, name (and optional group name in the fifth field), e.g.
19 63549983 63556677 A1BG
10 52236330 52315441 A1CF
8 43266741 43337485 A26A1
15 19305252 19336667 A26B1
21 13904368 13935777 A26B3
2 131692393 131738886 A26C1A
...
In this example, the ranges correspond to genes, although they could
in practice correspond to any type of intervals. That is, the
--annotate function can be used with any generic set of
ranges, as defined by the user (e.g. with regions corresponding to
linkage peaks, regions under positive selection, etc).
NOTE The glist.txt file discussed here is
also available from the resources page.
Given these three files, the --annotate command will append
the attribute and range information, where appropriate, to the input
file, e.g. plink.annot might begin:
CHR SNP BP P ANNOT
1 rs3094315 792429 0.1521 =missense
1 rs6672353 817376 0.3649 .
1 rs4040617 819185 0.2315 =missense
1 rs4075116 1043552 0.3453 C1orf159(+1.953kb)
1 rs9442385 1137258 0.3968 TNFRSF4(0)|TNFRSF18(+5.306kb)|SDF4(-4.892kb)
...
for example, indicating that rs3094315 is in strong LD with a
missense SNP, and that rs9442385 is in the gene
TNFRSF4, about 5kb away from two other genes,
TNFRSF18 and SDF4.
NOTE It is not required for the input file to have CHR
and BP fields if ranges are not applied (i.e. attributes are assigned to
SNPs based solely on the unique identifier/rs-number, not genomic location). Similarly,
the P field is not required, unless --pfilter has been specified.
Misc. options
There are several options that can modify the behavior of --annotate.
Filters
To filter on regions (so the plink.annot file only contains SNPs in those regions) use
filter=myreg.txt
where myreg.txt is in the same format as the gene/range list above.
To only include a specific set of SNPs from the input file, use
snps=mysnps.txt
where mysnps.txt is just a list of SNP IDs.
To only apply a subset of the ranges for annotation, the
subset=myfile.txt
where myfile.txt is a list of range names (i.e. corresponding to the file specified by ranges=).
To ouput only SNPs that have at least some annotation, use the option
prune
To filter based on p-value, if that field is present (in header, the
P field), use the separate command (i.e. not an option, so
has --):
--pfilter 0.05
Output options
To alter the format of the output file, so that a series of 0
and 1 variables are output for each attribute and/or range,
use the option
block
For example, instead of
SNP CHR BP ANNOT
rs001 1 1111 .
rs002 1 2222 =NONSENSE
rs003 1 3333 =nonsense
rs004 1 4444 .
the plink.annot file would contain
SNP CHR BP =NONSENSE =nonsense
rs001 1 1111 0 0
rs002 1 2222 1 0
rs003 1 3333 0 1
rs004 1 4444 0 0
To place a NA symbol instead of . in the ANNOT field when no annotation is found, add the option
NA
This can make files easier to read into statistic packages, for example.
To specify a particular border for genes/ranges (i.e. such that genes/ranges within X kb of the SNP
are reported as near that SNP), use the command, e.g. for a 20kb border,
--border 20
To only list the gene/range name, and not the kb distance following it, add the option:
minimal
To generate an additional output field that contains the kb distance
to the nearest gene, and a field indicating whether the nearest gene
is upstream or downstream (+, -), add the option:
distance
|
|