PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Proxy association

This page describes a convenience function designed to provide a quick representation of a single SNP association, in terms of the surrounding haplotypic background. Specifically, given a particular (reference) SNP this approach involves a) finding flanking markers and haplotypes (proxies) that are in strong linkage disequilibrium with the reference SNP and, b) testing these proxies for association with disease, within a haplotype-based framework.

There are three main applications of this utility, which are described in more detail and with examples in the main text below:
  • technical validation of single SNP results ( by looking for flanking haplotypes involving different markers that also show the same result )
  • refining a single SNP association signal ( is there a stronger association with a local haplotype? )
  • more robust single SNP tests ( by framing single SNP tests within a haplotypic framework, some degree of control against non-random genotyping failure can be achieved )
The proxy approach also forms the basis of the imputation methods in PLINK, described separately. The methods are identical in fact, the only difference in imputation mode is the presence of a reference set of individuals that is handled specially.

The proxy methods use the same basic EM algorithm used by the other haplotyping methods in PLINK. The only difference is that the proxy methods put a wrapper around the basic haplotyping procedure that a) provides some methods to automatically select proxies to phase given a designated reference SNP, and b) frames the subsequent tests and summaries in terms of groups of haplotypes that track the reference SNP.

Proxy association: basic usage

The basic proxy association method for a particular SNP is invoked with the --proxy-assoc option:

plink --file mydata --proxy-assoc rs6703905

which generates a file
     plink.proxy.report
This file contains three main sections, describing the local flanking SNPs, haplotypes and "proxies" for the reference SNP, and will be described below in turn. The full output file is shown here:


     *** Proxy haplotype association report for rs13232128 *** 

            SNP      MAF     GENO       KB      RSQ       OR    CHISQ        P
      rs1389273    0.286  0.00173    -99.2   0.0932    0.916     2.61    0.106
     rs10236783    0.253   0.0236    -66.9    0.214    0.875      5.7    0.017
     rs17556689    0.328  0.00259    -66.7    0.282      1.1     3.09    0.079
     rs17135491    0.153  0.00317    -2.59    0.153    0.934    0.955    0.328
     rs13232128    0.494   0.0179        0        *    0.828     14.9 0.000112
      rs1826529    0.487  0.00461     9.72    0.674    0.883     6.58   0.0103


        ....*.       FREQ         OR      CHISQ          P
        GTTGAG     0.0171       1.02     0.0221      0.882
        AGTGAG     0.0166      0.876      0.712      0.399
        GGTGAG      0.103       0.91       1.56      0.212
        GGCAAG     0.0226       0.97     0.0475      0.827
        ATTAAG      0.111      0.853       4.96      0.026
        GTTAAG     0.0615      0.877       2.09      0.149
        AGTAAG     0.0557      0.881       1.73      0.188
        GGTAAG     0.0513      0.949      0.306       0.58
        GGCAGG     0.0365       1.39       7.56    0.00596
        ATTAAT     0.0233      0.825       1.78      0.183
        GGCAGT      0.249       1.05      0.893      0.345
        ATTAGT     0.0201       1.31       3.26     0.0711
        AGTAGT      0.049       1.08      0.741      0.389
        GGTAGT       0.13       1.16       5.17      0.023

Haplotype frequency estimation based on 6938 of 6938 founder chromosomes
Omnibus haplotype test statistic: 23.3, df = 13, p = 0.0377

Of 125 subhaplotypes considered, 8 met proxy criteria

           HAP       FREQ        RSQ         OR      CHISQ          P
        ..T. G      0.422       0.72      0.843       12.3   0.000449
        .G.. T      0.453      0.705       1.15       8.62    0.00333
        .G.A T      0.445      0.693       1.14       7.22     0.0072
        GG.. T      0.399      0.561       1.14       7.15     0.0075
        .... G      0.487      0.674      0.883       6.58     0.0103
        ...A T      0.505      0.661       1.12       5.56     0.0184
        G... T      0.415      0.542       1.11       4.99     0.0255
        G..A T      0.408      0.535        1.1       4.19     0.0408
The first section lists the reference SNP (rs13232128) and 5 flanking SNPs that have been automatically selected as proxies. For each SNP, the minor allele frequency (MAF), genotyping failure rate (GENO) and distance to the reference SNP (KB) is given. A measure of single SNP association is also given for each SNP: odds ratio (OR), chi-squared statistic (CHISQ) and asymptotic p-value (P).

Importantly, however, these single SNP tests are not quite the same as from the basic --assoc command, as they are formed within the haplotypic context of the flanking SNPs. That is, for example, a single SNP test of the 5th SNP is formed by grouping the haplotypes as shown below, and testing for a difference in the frequency of the first group (containing A at the 5th position) versus the second group (all containing G).
     GTTG-A-G
     AGTG-A-G
     GGTG-A-G
     GGCA-A-G
     ATTA-A-G
     GTTA-A-G
     AGTA-A-G
     GGTA-A-G
     ATTA-A-T
    
      versus

     GGCA-G-G
     GGCA-G-T
     ATTA-G-T
     AGTA-G-T
     GGTA-G-T
Because the test is conducted in the context of a haplotypic test, it has some slightly different properties to the standard association test, which can sometimes be used to advantage. In particular, when there is strong LD in the region, the haplotype information will often help to fill in missing genotype data for single SNPs. Therefore, rather than throwing away individuals with missing genotype data, it is possible to try to reconstruct it from the surrounding region: this can lessen the impact of non-random genotyping failure causing spurious associations, as described below.

In this example, note that no other surrounding SNPs appear to show strong association with disease, compared to the reference SNP: when looking at the pattern of LD (RSQ column) we see that there are no SNPs with very high LD (e.g. over 0.8) to the reference SNP, so this is not necessarily surprising. Other haplotypes might be, however: this is what the rest of the report considers. The second section lists the haplotypes formed in that region given all flanking proxy SNPs (including the reference SNP) and the frequency and association with disease of each of these haplotypes.

Finally, the third part of the report contains that information on single proxies SNPs or haplotypes of two of more proxies (subhaplotypes) but excluding the reference SNP, that are in LD with the reference SNP; this list is sorted by strength of association with disease and filtered by other criteria, described below. For example, the first line in the above example is:
 
        HAP       FREQ        RSQ         OR      CHISQ          P
     ..T. G      0.422       0.72      0.843       12.3   0.000449
This suggest although no single SNP shows a similar association to the reference SNP in this region, a haplotype does show association results of a similar magnitude and is correlated with the reference SNP (in this case, the TG haplotype formed by the third and last proxy SNPs).

So, in this particular example, this might be taken as additional support for the association: it is of course still possible that the association is just due to chance, or due to population stratification, etc, but this would suggest that it is unlikely to be due to some technical genotyping artefact that was specific to the reference SNP, as we are also seeing the same signal from other SNPs (or, as in this case, a haplotype formed from two other SNPs).

Naturally, if one considers enough proxy haplotypes, some are bound to show stronger association with disease than the reference SNP merely due to chance. One should therefore be careful in how these tests are interpreted, i.e. not to forget the multiple testing that is implicit here.

This kind of analysis represents the typical use case for proxy association: we may have a single SNP association result, but the SNP might be rare or have a higher genotyping failure rate than we would like. Rather than exclude that SNP altogether, one option is to include the SNP in analysis, assess evidence for assocation, and then also ask whether other SNPs show the same signal. The assumption is that although the true alleles at the proxy SNPs are (hopefully) not independent of the reference SNP (i.e. there is LD) any technical genotyping artefact that influenced the reference SNP is unlikely to also be impacting the proxy SNPs (i.e. the implicit model of genotyping failure is that most SNPs are okay, but a few SNPs might fail: as such, we can use the surrounding genotype data to fill-in failed genotypes, even if these SNPs failed in a very biased way, e.g. if only TT homozygotes tended to fail and only in cases).
Heuristic for selection of proxy SNPs
The main parameters for SNP selection are:
  • LD thresholds between the index and proxies, and between the proxies themselves
  • Maximum number of SNPs and kb range to search for proxies
  • Maximum number of proxies to include
There are four main commands to influence the search strategy for proxies:
     --proxy-r2       A  B  C 
     --proxy-window   # SNPs to search
     --proxy-kb       kb distance
     --proxy-maxsnp   # SNPs to include in final set
Proxies are chosen based on LD with the reference SNP as follows. Proxies are examined one at a time in order of strongest to weakest LD with the reference. A proxy must be above a certain minimum r-squared threshold with the reference (criterion A), although if we already have two proxies selected, a different threshold is used (criterion B). In both cases, for a proxy to be added, it must not have an r-squared greater than criterion C with any proxy already selected. For common SNPs, the default values for A, B and C are:
    Is a proxy? 
          A) r-sq > 0.00 with reference          if < 2 proxies selected
          B) r-sq > 0.25 with reference          if 2 or more proxies selected
          C) r-sq < 0.50 with any other proxy
Setting A lower than B, and to 0 by default, ensures that we always allow a chance of finding a 2-SNP haplotype that might tag the reference SNP, even if no single SNP does.

By default, proxy association selects up to 5 (--proxy-maxsnp) SNPs flanking the reference SNP, from a search of 15 SNPs (--proxy-window) either side of the reference, at most 250 kb away (--proxy-kb).

The defaults vary depending on the frequency of the index SNP: for rarer SNPs (MAF less than 0.1), a slightly larger search space will be used. This threshold can be changed with the command
     --proxy-b-threshold 0.05
In contrast to common SNPs, for which the defaults are:
     --proxy-r2       0.00  0.25  0.50
     --proxy-window   15
     --proxy-kb       250
     --proxy-maxsnp   5 
these values for rarer SNPs (as defined by --proxy-b-threshold) and the commands that can be used to change them:
     --proxy-b-r2       0.00  0.01  0.50
     --proxy-b-window   30
     --proxy-b-kb       500
     --proxy-b-maxsnp   510
In other words, the search space is increased for rarer SNPs, to increase the chance that a good haplotypic proxy is found even if there is no other single SNP that well captures the variation at the index site.

In addition, proxies must by default be above 0.01 MAF and below 0.05 genotyping failure rate. To explicitly select only more common proxies with very high genotyping rate (e.g. to verify association at a reference SNP with lower genotyping rate and a very rare allele), then set values for
     --proxy-maf
and
     --proxy-geno
appropriately (these mirror the basic --maf and --geno commands).

Finally, there are some parameters that determine the behavior of the haplotypic proxy search (the 3rd section of the verbose output). Haplotypes formed by proxies must have a frequency of at least 0.01; these haplotypes must show an r-squared of at least 0.5 with the reference; when considering all possible subhaplotypes, only permutations of up to 3 SNP-haplotypes are considered.

Overall, it is possible to change the behaviour of the basic proxy selection heuristic with the following commands:
  • to select a different number of flanking SNPs (--proxy-window)
  • to filter proxy SNPs on distance to reference (--proxy-kb)
  • to specify the maximum number of proxies (--proxy-maxsnp)
  • to filter proxy SNPs on LD with reference (--proxy-r2)
  • to not filter proxy SNPs on LD (--proxy-no-r2-filter, i.e. same as --proxy-r2 0 0 1)
  • to filter proxy SNPs on MAF (--proxy-maf)
  • to filter proxy SNPs on genotyping rate (--proxy-geno)
  • to select a specifc set of flanking SNPs (--proxy-flanking)

  • to filter haplotypes based on frequency (--proxy-mhf)
  • to filter haplotypes based on LD with reference (--proxy-sub-r2)
  • to select different levels of subhaplotype search (--proxy-sub-maxsnp)
For example, to select up to 6 SNPs, that are above 0.10 MAF and 0.01 genotyping failure rate, that are within 100 kb and 10 SNPs of the reference SNP and that have an r-squared of at least 0.1 with the refernce but no greater than 0.5 with an already-selected proxy SNP; and then to look at all haplotype proxies that are above 0.005 minor haplotype frequency and have an r-squared of at least 0.8 with the reference SNP, use the command (line breaks added for clarity):
 plink --file mydata --proxy-assoc rs6703905
                     --proxy-maxsnp 6
                     --proxy-r2 0.1 0.1 0.5
                     --proxy-window 10
                     --proxy-kb 100
                     --proxy-maf 0.1 
                     --proxy-geno 0.01 
                     --proxy-sub-r2 0.8 
                     --proxy-mhf 0.005 

As mentioned, rather than use the heuristic above, you can specify a particular set of SNPs with the command
plink --file mydata --proxy-assoc rs6703905 --proxy-flanking my.proxy.list

where my.proxy.list is a file listing the SNPs you wish to use as proxies for rs6703905, for example.

Warning There will possibly be a very, very large number of possible combinations to consider if you make both --proxy-maxsnp and --proxy-sub-maxsnp too large, meaning that the analysis will take a very long time to run. You should probably keep --proxy-maxsnp less than 10 and --proxy-sub-maxsnp less than 6.

HINT To speed up the proxy report, you need only load in the relevant chromosomal region: that is, use the --snp and --window options:
plink --bfile mydata --proxy-assoc rs12345 --snp rs12345 --window 300

Specifying the type of association test
By default, the --proxy-assoc command only applies to population-based samples of unrelated individuals. It is suitable for either disease (case/control) or quantitative trait outcomes: the appropriate test will automatically be selected depending on the phenotype.

The basic command cannot include covariates: however, if the flag --proxy-glm is added, then the routines that correspond to --linear and --logistic are used instead to test the proxy association, meaning that covariates can be included (this is slightly slower than the default analysis), e.g.
plink --bfile mydata --proxy-assoc rs12345 --proxy-glm --covar mycov.txt

BETA There is preliminary support for the TDT in this context with the --proxy-tdt option; this has not yet been fully tested however, and we do not yet suggest you use it generally.

Refining a single SNP association

The proxy association report is primarily designed simply to provide a convenient way to automatically scan for evidence of the same association signal coming from different sets of markers (that are assumed to be independent in terms of technical artefact but not LD). Of course, it is entirely possible that a 'proxy' may show a markedly stronger association than the original reference SNP. In this way, one might think of using the --proxy-assoc method as a way to refine an association signal, or fine-map a region. In a whole genome context, there is clearly nothing special about the particular SNP genotyped that shows association: it may be representing just the tip of an iceberg in association space, and certain haplotypes might have a stronger association. One strategy and way of using haplotype inflormation in a whole genome context, therefore, might be to scan all single SNPs for modest levels of association, and then exhaustively search the haplotype space surrounding those SNPs, but constraining the search to only haplotypes that are in LD with the original SNP (in this way, keeping the multiple testing burden somewhat under control, as although many more tests are added, they will all be quite highly correlated).

As implied in the section above, remember that taking just the best proxy association result (i.e. the top listed in the 3rd section of the report) will capitalize on chance and so these best values will not follow asymptotic null test statistic distributions. These p-values are perhaps best interpreted either against a set genome-wide significance threshold, or corrected for the number of subhaplotypes tested for a given reference SNP.

Automating for multiple references SNPs

To faciliate looking at more than one reference SNP at a time, you can use the command
plink --bfile mydata --proxy-assoc all

or
plink --bfile mydata --proxy-assoc all --proxy-list hits.list

That is, instead of a SNP name after --proxy-assoc, put the keyword all. PLINK will then treat as the reference, one at a time, either all SNPs in the dataset (first usage) or in the subset listed in the file hits.list (second usage).

By default, only a restricted degree of output is given, and no "subhaplotype" tests are performed when more than one SNP is specified as the reference (i.e. these correspond to the third section in the above example output). To get the full report for every SNP (all listed in a single file) add the option
     --proxy-verbose
In non-verbose mode, the output is as follows, in a file
     plink.assoc.proxy
with fields
     CHR        Chromosome code
     SNP        Reference SNP
     BP         Physical position 
     A1         Name of first allele
     A2         Name of second allele
     GENO       Genotyping for the reference SNP
     NPRX       Number of proxy SNPs used to tag reference SNP
     INFO       Information metric for each reference SNP
     F_A        Reference SNP allele frequency in cases (disease traits)
     F_U        Reference SNP allele frequency in controls (disease traits)
     OR         Odds ratio (for disease traits)
     P          Asymptotic p-value for test of association
     PROXIES    (Optional, given --proxy-show-proxies)  Displays actual proxy SNPs used
For, example, here are some lines from such an output file, in this case with the
     --proxy-show-proxies
flag added, which appends the final PROXIES field to the output (lines truncated)

CHR        SNP       BP A1 A2    GENO NPRX  INFO    F_A    F_U   OR     P PROXIES
 17   rs731971 29529017  T  A 0.00605    3  1.01  0.103  0.104 1.02 0.849 rs4794990|rs11652429|...
 17  rs4794990 29529302  T  C 0.00346    3  1.01  0.102  0.104 1.02 0.838 rs731971|rs11652429|...
 17 rs12938546 29530562  C  A 0.00115    3 0.996 0.0322 0.0381 1.19 0.186 rs7359592|rs887071|...
 17 rs11652429 29531710  G  C 0.00115    5     1   0.32   0.32    1 0.984 rs11080256|rs1024613|...
The p-values reported here take account of the fact that the SNP has been probabilistically reconstructed. For example, the first line indicates that for rs731971 three proxy SNPs were selected, rs4794990, rs11652429 and rs887071.

The GENO and INFO have more meaning in the context of imputation, as described here, which involves running proxy association/imputation with a reference panel, such as the HapMap.

Providing some degree of robustness to non-random genotyping failure

When performing tests in a haplotype-context, the E-M algorithm is used to estimate haplotpe frequencies and each individual's posterior haplotype phase probabilities. The association test is then based on these fractional counts (i.e. allowing for ambiguity in inferred haplotypes). As such, missing genotypes are quite naturally accommodated in this framework: if for example an individual has genotypes for these 3 SNPs, then two haplotype phases are considered:
  Observed           Possible
  genotypes    -->   haplotypes            

  A/C A/C G/G  -->   AAG / CCG
                     ACG / CAG
whereas if the third SNP has a missing genotype (and if the other allele is T, for example) then the standard approach is just to consider a larger, consistent set (which are of course weighted by the current estimate of the population haploytpe frequencies):
  Observed           Possible            Possible 
  genotypes    -->   genotypes     -->   haplotypes            

  A/C A/C 0/0  -->   A/C A/C G/G   -->   AAG / CCG 
                                         ACG / CAG

               -->   A/C A/C T/T   -->   AAT / CCT
                                         ACT / CAT

               -->   A/C A/C G/T   -->   AAG / CCT
                                         ACG / CAT
                                         AAT / CCG
                                         ACT / CAG
In this way, if there is strong LD between SNPs, we can use the genotypes at flanking SNPs to effectively 'fill-in' missing genotype data. One advantage of this is that, if the genotypes are not missing at random for any given SNP, then it can give a less biased test to fill in the true values using LD information, rather than just to treat those genotypes as missing. This motivates a reframing of the basic single SNP association statistic in terms of groups of haplotypes rather than just as single SNPs (as shown above in the first example). Consider this example, involving simulated data, where the following haplotypes were simulated with these frequencies (in both cases and controls, so we would not expect any association with disease; 500 cases and 500 controls were generated).
     Haplotype    Population frequency
     AABAB        0.4
     AABBA        0.2
     ABBBA        0.2
     BBBBB        0.1
     AAABB        0.1
We will label the five SNPs, snp1, snp2, etc. Some non-random genotyping failure was simulated: in cases only, the BB genotype of snp3 only has a genotyping rate of 0.5 (i.e. half were set to missing). Such as pattern of genotyping failure, which is non-random with respect to both phenotype and genotype, can tend to produce spurious association results. For example, here are the basic single SNP results:
plink --file sim1 --assoc

which gives the output
     CHR  SNP   A1      F_A      F_U   A2        CHISQ            P           OR
       1 snp1    B    0.102    0.106    A      0.08585       0.7695        0.958
       1 snp2    B    0.297     0.31    A       0.3997       0.5272       0.9403
       1 snp3    A   0.1812    0.118    B        12.02    0.0005271        1.654  
       1 snp4    A    0.406    0.383    B        1.107       0.2927        1.101
       1 snp5    A    0.388    0.393    B      0.05252       0.8187       0.9792
Note how snp3 shows a strong association (this is solely due to the non-random drop-out of genotypes for this SNP). However, the proxy association will, in this case, correct this:
plink --file sim1 --proxy-assoc snp3 --mind 1 --geno 1

Note that we use --mind and --geno to ensure that PLINK does not discard any individuals, in this particular case (i.e. we will use the flanking SNPs to fill in the missing data). This analysis gives the following output


     *** Proxy haplotype association report for snp3 ***

 SNP      MAF     GENO       KB      RSQ       OR    CHISQ        P
snp1    0.104        0   -0.002   0.0145    0.958   0.0859     0.77
snp2    0.303        0   -0.001   0.0544     0.94      0.4    0.527
snp3    0.141    0.213        0        *    0.868    0.993    0.319
snp4    0.394        0    0.001   0.0813      1.1     1.11    0.293
snp5     0.39        0    0.002     0.08    0.979   0.0525    0.819


         ..*..       FREQ         OR      CHISQ          P
         ABBBA      0.199      0.945      0.254      0.615
         AABBA      0.191       1.03     0.0518       0.82
         AABAB      0.394        1.1       1.11      0.293
         AAABB      0.111      0.868      0.993      0.319
         BBBBB      0.104      0.958     0.0859       0.77

Haplotype frequency estimation based on 2000 of 2000 founder chromosomes
Omnibus haplotype test statistic: 1.88, df = 4, p = 0.759

           HAP       FREQ        RSQ         OR      CHISQ          P
         .A BB      0.111          1      0.868      0.993      0.319
         A. BB      0.111          1      0.868      0.993      0.319
In otherwords, instead of removing individuals who are missing for snp3 (which is implicitly what a single SNP association statistic would do) we use the flanking data to fill in the unobserved genotypes. Even if these are misssing not-at-random, if there is strong LD then we will often be able to do a good job at guessing the true genotype. Note that the other SNPs (that have no missing genotype data) have identical association p-values under basic association test as under this constrained haplotype test, as would be expected (i.e. under most normal conditions, there is no loss of power in using a proxy-association approach).

IMPORTANT It is very important to remember that this test is not a panacea for the problem of missing data: many times there will not be sufficient LD to accurately reconstruct the missing genotype within the E-M. Future versions of PLINK aim to add diagnostics to indicate when this is the case; also, one might select the SNPs that define the flanking region more intelligently (e.g. making use of known patterns of LD, etc).

As such, thie results of this test should most probably be interpreted as follows: if a highly significant basic single SNP association result is not significant by this method, one would worry about biased missingness for that SNP; if a highly significant basic single SNP result remains highly significant, this is only meaningful when there is strong LD.

Of course, it is possible that other biases that are specific to haplotype analysis (the ability to estimate rare haplotype frequencies, etc) will impact these proxy tests, the effects of stratification may be more pronounced, etc. As such, these tests should be interpreted only as complementary pieces of information along with the basic SNP result, rather than as water-tight proof of an unbiased association per se.

However, if one knew up front that non-random genotyping drop-out might be an issue (for example, cases and controls from from different labs, different genotyping procedures used, etc) then it might seem prudent to take this approach.

Note Normally individuals are removed from the haplotype analysis if they are missing more than 50% of their genotypes for a given haplotype: in this case, we try to not remove individuals, but rather let the E-M fill in the missing data, so the rate is changed to 0.9 by default; this can be altered with the --hap-miss option.
 
This document last modified Wednesday, 25-Jan-2017 11:52:28 EST