1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Proxy association
This page describes a convenience function designed to provide a quick
representation of a single SNP association, in terms of the
surrounding haplotypic background. Specifically, given a particular
(reference) SNP this approach involves a) finding flanking markers and
haplotypes (proxies) that are in strong linkage disequilibrium with
the reference SNP and, b) testing these proxies for association with
disease, within a haplotype-based framework.
There are three main applications of this utility, which are described
in more detail and with examples in the main text below:
- technical validation of single SNP results ( by looking for
flanking haplotypes involving different markers that also show the
same result )
- refining a single SNP association signal ( is there a stronger
association with a local haplotype? )
- more robust single SNP tests ( by framing single SNP tests
within a haplotypic framework, some degree of control against
non-random genotyping failure can be achieved )
The proxy approach also forms the basis of the imputation methods in
PLINK, described separately. The
methods are identical in fact, the only difference in imputation mode
is the presence of a reference set of individuals that is handled
specially.
The proxy methods use the same basic EM algorithm used by the other
haplotyping methods in PLINK. The only
difference is that the proxy methods put a wrapper around the basic
haplotyping procedure that a) provides some methods to automatically
select proxies to phase given a designated reference SNP, and b)
frames the subsequent tests and summaries in terms of groups of
haplotypes that track the reference SNP.
Proxy association: basic usage
The basic proxy association method for a particular SNP is invoked
with the --proxy-assoc option:
plink --file mydata --proxy-assoc rs6703905
which generates a file
plink.proxy.report
This file contains three main sections, describing the local flanking
SNPs, haplotypes and "proxies" for the reference SNP, and will be
described below in turn. The full output file is shown here:
*** Proxy haplotype association report for rs13232128 ***
SNP MAF GENO KB RSQ OR CHISQ P
rs1389273 0.286 0.00173 -99.2 0.0932 0.916 2.61 0.106
rs10236783 0.253 0.0236 -66.9 0.214 0.875 5.7 0.017
rs17556689 0.328 0.00259 -66.7 0.282 1.1 3.09 0.079
rs17135491 0.153 0.00317 -2.59 0.153 0.934 0.955 0.328
rs13232128 0.494 0.0179 0 * 0.828 14.9 0.000112
rs1826529 0.487 0.00461 9.72 0.674 0.883 6.58 0.0103
....*. FREQ OR CHISQ P
GTTGAG 0.0171 1.02 0.0221 0.882
AGTGAG 0.0166 0.876 0.712 0.399
GGTGAG 0.103 0.91 1.56 0.212
GGCAAG 0.0226 0.97 0.0475 0.827
ATTAAG 0.111 0.853 4.96 0.026
GTTAAG 0.0615 0.877 2.09 0.149
AGTAAG 0.0557 0.881 1.73 0.188
GGTAAG 0.0513 0.949 0.306 0.58
GGCAGG 0.0365 1.39 7.56 0.00596
ATTAAT 0.0233 0.825 1.78 0.183
GGCAGT 0.249 1.05 0.893 0.345
ATTAGT 0.0201 1.31 3.26 0.0711
AGTAGT 0.049 1.08 0.741 0.389
GGTAGT 0.13 1.16 5.17 0.023
Haplotype frequency estimation based on 6938 of 6938 founder chromosomes
Omnibus haplotype test statistic: 23.3, df = 13, p = 0.0377
Of 125 subhaplotypes considered, 8 met proxy criteria
HAP FREQ RSQ OR CHISQ P
..T. G 0.422 0.72 0.843 12.3 0.000449
.G.. T 0.453 0.705 1.15 8.62 0.00333
.G.A T 0.445 0.693 1.14 7.22 0.0072
GG.. T 0.399 0.561 1.14 7.15 0.0075
.... G 0.487 0.674 0.883 6.58 0.0103
...A T 0.505 0.661 1.12 5.56 0.0184
G... T 0.415 0.542 1.11 4.99 0.0255
G..A T 0.408 0.535 1.1 4.19 0.0408
The first section lists the reference SNP (rs13232128) and 5 flanking
SNPs that have been automatically selected as proxies. For each SNP,
the minor allele frequency (MAF), genotyping failure rate (GENO) and
distance to the reference SNP (KB) is given. A measure of single SNP
association is also given for each SNP: odds ratio (OR), chi-squared
statistic (CHISQ) and asymptotic p-value (P).
Importantly, however, these single SNP tests are not
quite the same as from the basic --assoc command, as they are
formed within the haplotypic context of the flanking SNPs. That is,
for example, a single SNP test of the 5th SNP is formed by grouping
the haplotypes as shown below, and testing for a difference in the
frequency of the first group (containing A at the 5th
position) versus the second group (all containing G).
GTTG-A-G
AGTG-A-G
GGTG-A-G
GGCA-A-G
ATTA-A-G
GTTA-A-G
AGTA-A-G
GGTA-A-G
ATTA-A-T
versus
GGCA-G-G
GGCA-G-T
ATTA-G-T
AGTA-G-T
GGTA-G-T
Because the test is conducted in the context of a haplotypic test, it
has some slightly different properties to the standard association
test, which can sometimes be used to advantage. In particular, when
there is strong LD in the region, the haplotype information will often
help to fill in missing genotype data for single SNPs. Therefore,
rather than throwing away individuals with missing genotype data, it
is possible to try to reconstruct it from the surrounding region: this
can lessen the impact of non-random genotyping failure causing
spurious associations, as described below.
In this example, note that no other surrounding SNPs appear to show
strong association with disease, compared to the reference SNP: when
looking at the pattern of LD (RSQ column) we see that there are no
SNPs with very high LD (e.g. over 0.8) to the reference SNP, so this
is not necessarily surprising. Other haplotypes might be,
however: this is what the rest of the report considers. The second
section lists the haplotypes formed in that region given all flanking
proxy SNPs (including the reference SNP) and the frequency and
association with disease of each of these haplotypes.
Finally, the third part of the report contains that information on
single proxies SNPs or haplotypes of two of more proxies
(subhaplotypes) but excluding the reference SNP, that are in LD with
the reference SNP; this list is sorted by strength of association with
disease and filtered by other criteria, described below. For example,
the first line in the above example is:
HAP FREQ RSQ OR CHISQ P
..T. G 0.422 0.72 0.843 12.3 0.000449
This suggest although no single SNP shows a similar association to the
reference SNP in this region, a haplotype does show association
results of a similar magnitude and is correlated with the reference
SNP (in this case, the TG haplotype formed by the third and
last proxy SNPs).
So, in this particular example, this might be taken as additional
support for the association: it is of course still possible that the
association is just due to chance, or due to population
stratification, etc, but this would suggest that it is unlikely to be
due to some technical genotyping artefact that was specific to the
reference SNP, as we are also seeing the same signal from other SNPs
(or, as in this case, a haplotype formed from two other SNPs).
Naturally, if one considers enough proxy haplotypes, some are bound to
show stronger association with disease than the reference SNP merely
due to chance. One should therefore be careful in how these tests are
interpreted, i.e. not to forget the multiple testing that is implicit
here.
This kind of analysis represents the typical use case for proxy
association: we may have a single SNP association result, but the SNP
might be rare or have a higher genotyping failure rate than we would
like. Rather than exclude that SNP altogether, one option is to
include the SNP in analysis, assess evidence for assocation, and then
also ask whether other SNPs show the same signal. The assumption is
that although the true alleles at the proxy SNPs are (hopefully) not
independent of the reference SNP (i.e. there is LD) any technical
genotyping artefact that influenced the reference SNP is unlikely to
also be impacting the proxy SNPs (i.e. the implicit model of
genotyping failure is that most SNPs are okay, but a few SNPs might
fail: as such, we can use the surrounding genotype data to fill-in
failed genotypes, even if these SNPs failed in a very biased way,
e.g. if only TT homozygotes tended to fail and only in
cases).
Heuristic for selection of proxy SNPs
The main parameters for SNP selection are:
- LD thresholds between the index and proxies, and between the proxies themselves
- Maximum number of SNPs and kb range to search for proxies
- Maximum number of proxies to include
There are four main commands to influence the search strategy for
proxies:
--proxy-r2 A B C
--proxy-window # SNPs to search
--proxy-kb kb distance
--proxy-maxsnp # SNPs to include in final set
Proxies are chosen based on LD with the reference SNP as
follows. Proxies are examined one at a time in order of strongest to
weakest LD with the reference. A proxy must be above a certain minimum
r-squared threshold with the reference (criterion A),
although if we already have two proxies selected, a different
threshold is used (criterion B). In both cases, for a proxy
to be added, it must not have an r-squared greater than
criterion C with any proxy already selected. For common SNPs,
the default values for A, B and C are:
Is a proxy?
A) r-sq > 0.00 with reference if < 2 proxies selected
B) r-sq > 0.25 with reference if 2 or more proxies selected
C) r-sq < 0.50 with any other proxy
Setting A lower than B, and to 0 by default, ensures
that we always allow a chance of finding a 2-SNP haplotype that might
tag the reference SNP, even if no single SNP does.
By default, proxy association selects up to 5
(--proxy-maxsnp) SNPs flanking the reference SNP, from a
search of 15 SNPs (--proxy-window) either side of the
reference, at most 250 kb away (--proxy-kb).
The defaults vary depending on the frequency of the index SNP: for
rarer SNPs (MAF less than 0.1), a slightly larger search space will be
used. This threshold can be changed with the command
--proxy-b-threshold 0.05
In contrast to common SNPs, for which the defaults are:
--proxy-r2 0.00 0.25 0.50
--proxy-window 15
--proxy-kb 250
--proxy-maxsnp 5
these values for rarer SNPs (as defined by --proxy-b-threshold) and the commands that can be used to change them:
--proxy-b-r2 0.00 0.01 0.50
--proxy-b-window 30
--proxy-b-kb 500
--proxy-b-maxsnp 510
In other words, the search space is increased for rarer SNPs, to
increase the chance that a good haplotypic proxy is found even if
there is no other single SNP that well captures the variation at the
index site.
In addition, proxies must by default be above 0.01 MAF and below 0.05
genotyping failure rate. To explicitly select only more common proxies
with very high genotyping rate (e.g. to verify association at a
reference SNP with lower genotyping rate and a very rare allele), then
set values for
--proxy-maf
and
--proxy-geno
appropriately (these mirror the basic --maf
and --geno commands).
Finally, there are some parameters that determine the behavior of the
haplotypic proxy search (the 3rd section of the verbose
output). Haplotypes formed by proxies must have a frequency of at
least 0.01; these haplotypes must show an r-squared of at least 0.5
with the reference; when considering all possible subhaplotypes, only
permutations of up to 3 SNP-haplotypes are considered.
Overall, it is possible to change the behaviour of the basic proxy
selection heuristic with the following commands:
- to select a different number of flanking SNPs (--proxy-window)
- to filter proxy SNPs on distance to reference (--proxy-kb)
- to specify the maximum number of proxies (--proxy-maxsnp)
- to filter proxy SNPs on LD with reference (--proxy-r2)
- to not filter proxy SNPs on LD (--proxy-no-r2-filter, i.e. same as --proxy-r2 0 0 1)
- to filter proxy SNPs on MAF (--proxy-maf)
- to filter proxy SNPs on genotyping rate (--proxy-geno)
- to select a specifc set of flanking SNPs (--proxy-flanking)
- to filter haplotypes based on frequency (--proxy-mhf)
- to filter haplotypes based on LD with reference (--proxy-sub-r2)
- to select different levels of subhaplotype search (--proxy-sub-maxsnp)
For example, to select up to 6 SNPs, that are above 0.10 MAF and 0.01
genotyping failure rate, that are within 100 kb and 10 SNPs of the
reference SNP and that have an r-squared of at least 0.1 with the
refernce but no greater than 0.5 with an already-selected proxy SNP;
and then to look at all haplotype proxies that are above 0.005 minor
haplotype frequency and have an r-squared of at least 0.8 with the
reference SNP, use the command (line breaks added for clarity):
plink --file mydata --proxy-assoc rs6703905
--proxy-maxsnp 6
--proxy-r2 0.1 0.1 0.5
--proxy-window 10
--proxy-kb 100
--proxy-maf 0.1
--proxy-geno 0.01
--proxy-sub-r2 0.8
--proxy-mhf 0.005
As mentioned, rather than use the heuristic above, you can specify a
particular set of SNPs with the command
plink --file mydata --proxy-assoc rs6703905 --proxy-flanking my.proxy.list
where my.proxy.list is a file listing the SNPs you wish to use as proxies for
rs6703905, for example.
Warning There will possibly be a very, very large
number of possible combinations to consider if you make
both --proxy-maxsnp and --proxy-sub-maxsnp too
large, meaning that the analysis will take a very long time to
run. You should probably keep --proxy-maxsnp less than 10
and --proxy-sub-maxsnp less than 6.
HINT To speed up the proxy report, you need only
load in the relevant chromosomal region: that is, use
the --snp and --window options:
plink --bfile mydata --proxy-assoc rs12345 --snp rs12345 --window 300
Specifying the type of association test
By default, the --proxy-assoc command only applies to
population-based samples of unrelated individuals. It is suitable for
either disease (case/control) or quantitative trait outcomes: the
appropriate test will automatically be selected depending on the
phenotype.
The basic command cannot include covariates: however, if the
flag --proxy-glm is added, then the routines that correspond
to --linear and --logistic are used instead to test
the proxy association, meaning that covariates can be included (this
is slightly slower than the default analysis), e.g.
plink --bfile mydata --proxy-assoc rs12345 --proxy-glm --covar mycov.txt
BETA There is preliminary support for the TDT in this
context with the --proxy-tdt option; this has not yet been
fully tested however, and we do not yet suggest you use it generally.
Refining a single SNP association
The proxy association report is primarily designed simply to provide a
convenient way to automatically scan for evidence of the same
association signal coming from different sets of markers (that are
assumed to be independent in terms of technical artefact but not LD).
Of course, it is entirely possible that a 'proxy' may show a markedly
stronger association than the original reference SNP. In this way,
one might think of using the --proxy-assoc method as a way to
refine an association signal, or fine-map a region. In a whole genome
context, there is clearly nothing special about the particular SNP
genotyped that shows association: it may be representing just the tip
of an iceberg in association space, and certain haplotypes might have
a stronger association. One strategy and way of using haplotype
inflormation in a whole genome context, therefore, might be to scan
all single SNPs for modest levels of association, and then
exhaustively search the haplotype space surrounding those
SNPs, but constraining the search to only haplotypes that are in
LD with the original SNP (in this way, keeping the multiple
testing burden somewhat under control, as although many more tests are
added, they will all be quite highly correlated).
As implied in the section above, remember that taking just the best
proxy association result (i.e. the top listed in the 3rd section of
the report) will capitalize on chance and so these best values will
not follow asymptotic null test statistic distributions. These
p-values are perhaps best interpreted either against a set genome-wide
significance threshold, or corrected for the number of subhaplotypes
tested for a given reference SNP.
Automating for multiple references SNPs
To faciliate looking at more than one reference SNP at a time, you can
use the command
plink --bfile mydata --proxy-assoc all
or
plink --bfile mydata --proxy-assoc all --proxy-list hits.list
That is, instead of a SNP name after --proxy-assoc, put the
keyword all. PLINK will then treat as the reference, one at a
time, either all SNPs in the dataset (first usage) or in the subset
listed in the file hits.list (second usage).
By default, only a restricted degree of output is given, and no
"subhaplotype" tests are performed when more than one SNP is specified
as the reference (i.e. these correspond to the third section in the
above example output). To get the full report for every SNP (all
listed in a single file) add the option
--proxy-verbose
In non-verbose mode, the output is as follows, in a file
plink.assoc.proxy
with fields
CHR Chromosome code
SNP Reference SNP
BP Physical position
A1 Name of first allele
A2 Name of second allele
GENO Genotyping for the reference SNP
NPRX Number of proxy SNPs used to tag reference SNP
INFO Information metric for each reference SNP
F_A Reference SNP allele frequency in cases (disease traits)
F_U Reference SNP allele frequency in controls (disease traits)
OR Odds ratio (for disease traits)
P Asymptotic p-value for test of association
PROXIES (Optional, given --proxy-show-proxies) Displays actual proxy SNPs used
For, example, here are some lines from such an output file, in this case with the
--proxy-show-proxies
flag added, which appends the final PROXIES field to the output (lines truncated)
CHR SNP BP A1 A2 GENO NPRX INFO F_A F_U OR P PROXIES
17 rs731971 29529017 T A 0.00605 3 1.01 0.103 0.104 1.02 0.849 rs4794990|rs11652429|...
17 rs4794990 29529302 T C 0.00346 3 1.01 0.102 0.104 1.02 0.838 rs731971|rs11652429|...
17 rs12938546 29530562 C A 0.00115 3 0.996 0.0322 0.0381 1.19 0.186 rs7359592|rs887071|...
17 rs11652429 29531710 G C 0.00115 5 1 0.32 0.32 1 0.984 rs11080256|rs1024613|...
The p-values reported here take account of the fact that the SNP has
been probabilistically reconstructed. For example, the first line indicates that for
rs731971 three proxy SNPs were
selected, rs4794990, rs11652429
and rs887071.
The GENO and INFO have more meaning in the context
of imputation, as
described here, which involves running
proxy association/imputation with a reference panel, such as the
HapMap.
Providing some degree of robustness to non-random genotyping failure
When performing tests in a haplotype-context, the E-M algorithm is used to estimate
haplotpe frequencies and each individual's posterior haplotype phase probabilities.
The association test is then based on these fractional counts (i.e. allowing for ambiguity
in inferred haplotypes). As such, missing genotypes are quite naturally accommodated in this framework: if
for example an individual has genotypes for these 3 SNPs, then two haplotype phases are considered:
Observed Possible
genotypes --> haplotypes
A/C A/C G/G --> AAG / CCG
ACG / CAG
whereas if the third SNP has a missing genotype (and if the other allele
is T, for example) then
the standard approach is just to consider a larger, consistent set (which
are of course weighted by the current
estimate of the population haploytpe frequencies):
Observed Possible Possible
genotypes --> genotypes --> haplotypes
A/C A/C 0/0 --> A/C A/C G/G --> AAG / CCG
ACG / CAG
--> A/C A/C T/T --> AAT / CCT
ACT / CAT
--> A/C A/C G/T --> AAG / CCT
ACG / CAT
AAT / CCG
ACT / CAG
In this way, if there is strong LD between SNPs, we can
use the genotypes at flanking SNPs to effectively 'fill-in' missing
genotype data. One advantage of this is that, if the genotypes are not
missing at random for any given SNP, then it can
give a less biased test to fill in the true values using LD information,
rather than just to treat those genotypes as missing. This motivates a
reframing of the basic single SNP association statistic in terms of
groups of haplotypes rather than just as single SNPs (as shown above in
the first example). Consider this example, involving simulated data,
where the following haplotypes were simulated with these frequencies (in
both cases and controls, so we would not expect any association with
disease; 500 cases and 500 controls were generated).
Haplotype Population frequency
AABAB 0.4
AABBA 0.2
ABBBA 0.2
BBBBB 0.1
AAABB 0.1
We will label the five SNPs, snp1, snp2, etc. Some
non-random genotyping failure was simulated: in cases only, the
BB genotype of snp3 only has a genotyping rate of 0.5
(i.e. half were set to missing). Such as pattern of genotyping failure,
which is non-random with respect to both phenotype and genotype,
can tend to produce spurious association results. For example, here are
the basic single SNP results:
plink --file sim1 --assoc
which gives the output
CHR SNP A1 F_A F_U A2 CHISQ P OR
1 snp1 B 0.102 0.106 A 0.08585 0.7695 0.958
1 snp2 B 0.297 0.31 A 0.3997 0.5272 0.9403
1 snp3 A 0.1812 0.118 B 12.02 0.0005271 1.654
1 snp4 A 0.406 0.383 B 1.107 0.2927 1.101
1 snp5 A 0.388 0.393 B 0.05252 0.8187 0.9792
Note how snp3 shows a strong association (this is solely due to
the non-random drop-out of genotypes for this SNP). However, the proxy
association will, in this case, correct this:
plink --file sim1 --proxy-assoc snp3 --mind 1 --geno 1
Note that we use --mind and --geno to ensure that PLINK
does not discard any individuals, in this particular case (i.e. we will
use the flanking SNPs to fill in the missing data). This analysis gives the following
output
*** Proxy haplotype association report for snp3 ***
SNP MAF GENO KB RSQ OR CHISQ P
snp1 0.104 0 -0.002 0.0145 0.958 0.0859 0.77
snp2 0.303 0 -0.001 0.0544 0.94 0.4 0.527
snp3 0.141 0.213 0 * 0.868 0.993 0.319
snp4 0.394 0 0.001 0.0813 1.1 1.11 0.293
snp5 0.39 0 0.002 0.08 0.979 0.0525 0.819
..*.. FREQ OR CHISQ P
ABBBA 0.199 0.945 0.254 0.615
AABBA 0.191 1.03 0.0518 0.82
AABAB 0.394 1.1 1.11 0.293
AAABB 0.111 0.868 0.993 0.319
BBBBB 0.104 0.958 0.0859 0.77
Haplotype frequency estimation based on 2000 of 2000 founder chromosomes
Omnibus haplotype test statistic: 1.88, df = 4, p = 0.759
HAP FREQ RSQ OR CHISQ P
.A BB 0.111 1 0.868 0.993 0.319
A. BB 0.111 1 0.868 0.993 0.319
In otherwords, instead of removing individuals who are missing for
snp3 (which is implicitly what a single SNP association
statistic would do) we use the flanking data to fill in the unobserved
genotypes. Even if these are misssing not-at-random, if there is strong
LD then we will often be able to do a good job at guessing the true
genotype. Note that the other SNPs (that have no missing genotype data)
have identical association p-values under basic association test as under
this constrained haplotype test, as would be expected (i.e. under
most normal conditions, there is no loss of power in using a
proxy-association approach).
IMPORTANT It is very important to remember that this
test is not a panacea for the problem of missing data: many times there
will not be sufficient LD to accurately
reconstruct the missing genotype within the E-M. Future versions of PLINK
aim to add diagnostics to indicate when this is the case; also, one might
select the SNPs that define the flanking region more intelligently (e.g.
making use of known patterns of LD, etc).
As such, thie results of this test should most probably be interpreted as
follows: if a highly significant basic single SNP association result is
not significant by this method, one would worry about biased missingness
for that SNP; if a highly significant basic single SNP result
remains highly significant, this is only meaningful when there is strong
LD.
Of course, it is possible that other biases that are specific to
haplotype analysis (the ability to estimate rare haplotype frequencies,
etc) will impact these proxy tests, the effects of stratification may be
more pronounced, etc. As such, these tests should be interpreted only as
complementary pieces of information along with the basic SNP result,
rather than as water-tight proof of an unbiased association per
se.
However, if one knew up front that non-random genotyping drop-out
might be an issue (for example, cases and controls from from different
labs, different genotyping procedures used, etc) then it might seem
prudent to take this approach.
Note Normally individuals are removed from the
haplotype analysis if they are missing more than 50% of their genotypes
for a given haplotype: in this case, we try to not remove individuals,
but rather let the E-M fill in the missing data, so the rate is changed
to 0.9 by default; this can be altered with the --hap-miss
option.
|
|