PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK

Epistasis testing

This page contains extra details on the test for epistasis implemented in the --fast-epistasis command, designed for the detection of SNPxSNP pairwise interactions in large-scale case-control association studies.

This test is based on a Z-score for the difference in SNP-SNP assocation (odds ratio) between cases and controls (or in cases only, in a case-only analysis).

We follow the procedure for constructing an allelic test of a single locus, twice collapsing three genotype categories into two allele categories. Specifically, we count the 4N independent alleles observed at two loci in a sample of N individuals into a 2x2 table, following the logic below, so the allele (not the individual or haplotype) is the unit of analysis.
         BB Bb  bb       
     AA  a  b  c        
     Aa  d  e  f        
     aa  g  h  i
We first count alleles at one locus, e.g. B, conditional on the genotype at A, which can be represented as a 3x2 table:
         B     b
     AA  2a+b  2c+b
     Aa  2d+e  2f+e
     aa  2g+h  2i+h
which represents 2N alleles, not N individuals. We again collapse this 3x2 table into a 2x2 table, as follows
        B            b	   
     A  4a+2b+2d+e   4c+2b+2f+e
     a  4g+2h+2d+e   4i+2h+2f+e
Based on this 2x2 table, the odds ratio between loci A and B and its standard error are calculated in the standard manner. When cases and controls are present, the above procedure is performed separately in cases and controls, and the test for epistasis is the difference of the two odds ratios:
   Z = ( log(R) - log(S) )  / sqrt( SE(R) + SE(S) ) 
where R and S are the odds ratios in cases and controls respectively, estimated as ab/cd with variance 1/a+1/b+1/c+1/d and a,b,c,d are the four cells of the 2x2 table above. This test follows a standard normal distribution under the multiplicative model of no interaction.

Note that, despite superficial similarity to a table of 2N haplotypes (AB, Ab, aB and ab), this table ignores phase, i.e. we are not attempting to resolve phase for Aa/Bb individuals. Rather, given 4N independent alleles (assuming Hardy-Weinberg and linkage equilibrium for the two test loci), these 4N observations are simply counted following the scheme given above, that partitions the 4N counts into a 2x2 table. Whilst an inexact heuristic, we observe appropriate type I error rates in simulation (see below) and equivalent power to the logistic regression test. The correlation with a logistic regression analysis is very high (r = 0.995, based on -log10 P-value).

This table shows the type-I error of the case-control epistasis test. We considered three models that included no interaction between two unlinked SNPs and no marginal SNP effect (model 1) or a strong effect for one (model 2) or both SNPs (model 3). Type-I error is based on the analysis of 100,000 simulated datasets (disease prevalence = 0.01, minor allele frequency = 0.1 for both loci).
   Model   Marginal SNP effects        Nominal alpha 
              (Odds Ratio)   
           SNP A       SNP B           0.05    0.0005

   1       1.0         1.0             0.04750 0.00030
   2       1.0         1.4             0.04941 0.00030
   3       1.4         1.4             0.04817 0.00048
The power to detect a large interaction effect (GRR = 2) and no marginal single SNP effects was 0.74 (a = 1.2e-12; disease prevalence = 0.01, MAF = 0.1 for both loci). Power for other two-locus models can be estimated using the power calculator available through the genetic power calculator, GPC. Our procedure assumes Hardy-Weinberg and linkage equilibrium for the two SNPs hold in the population. However, simulation studies have shown the case/control test to be very robust to deviations from the linkage equilibrium assumption, whereas a case-only test is not (data not shown). Analogous to adopting an allelic single locus test, we also assume an allelic mode of gene action where any interaction term represents an allele-by-allele effect, not a genotype-by-genotype effect.

HINT If you use this to screen a large number of SNPs, you should probably report the more standard logistic regression test value also. In practice, both approaches usually give similar results, which justifies the use of --fast-epistasis as a screening tool for a computationally-demanding problem. Of course, given a specific (and often extreme) threshold, --epi1, the exact above-threshold list of SNPs will not always be the same; if you choose to use this approach, it is probably wise to apply it to select a subset of pairs of SNPs below a reasonably liberal --epi1 threshold to be tested with the more standard --epistasis command.
This document last modified Wednesday, 25-Jan-2017 11:51:34 EST