PSEQ association analysis

A number of approaches to testing phenotype-genotype associations are available in PLINK/Seq, using permutation or asymptotic statistics, with either single variants, genes or gene-sets as the unit of analysis. On this page we describe:

In general, we expect that methods for relating genotype to phenotype will be an area of future focus and expansion in PLINK/Seq, beyond the few methods described on this page.

Single variant association

To test each variant for association with a disease (dichotomous) outcome, use the command:

pseq proj v-assoc --phenotype my.phenotype

This assumes that my.phenotype is a dichotomous phenotype that exists in the INDDB. As for most commands, v-assoc can be combined with a mask.

In single VCF mode, rather than specifying a project file (proj) the VCF is directly specified. In this case, the --phenotype argument takes two parameters: the filename followed by the phenotype label:

pseq data.vcf.gz v-assoc --phenotype phenofile.txt my.phenotype

By default, the v-assoc command (expecting a dichotomous phenotype to be specified) uses Fisher's exact test to perform allelic, dominant and recessive single variant tests. To instead calculate empirical significance values based on a fixed number of permutations, add:

--perm 10000

To use adpative permutations:

--perm -1

The output is a tab-delimited file with one row per variant:

VAR REF SAMPLES FILTER VMETA CONMETA ALT MAF HWE MINA MINU OBSA OBSU REFA HETA HOMA REFU HETU HOMU P OR PDOM ORDOM PREC ORREC chr1:1105366:rs111751804 T CEU TSI PASS PASS AA=T;BN=132;GP=1:1115503 . C 0.0286885 1 4 3 57 65 53 4 0 62 3 0 0.708391 1.53939 0.704253 1.55975 1 1.13913 chr1:1105411:rs114390380 G CEU TSI PASS PASS AA=G;BN=132;GP=1:1115548 . A 0.0127119 1 1 2 53 65 52 1 0 63 2 0 1 0.609524 1 0.605769 1 1.2243 chr1:1108138:rs61733845 C CEU TSI PASS PASS AA=c;BN=129;GP=1:1118275 . T 0.036 1 3 6 61 64 58 3 0 58 6 0 0.50094 0.512605 0.492564 0.5 1 1.04878 chr1:1110240:rs116321663 T CEU TSI PASS PASS AA=T;BN=132;GP=1:1120377 . A 0.00645161 1 1 1 89 66 88 1 0 65 1 0 1 0.740113 1 0.738636 1 0.743017 chr1:1110294:rs1320571 G CEU TSI PASS PASS AA=A;BN=88;GP=1:1120431;HM2;HM3 . A 0.0387324 1 6 5 77 65 71 6 0 60 5 0 1 1.01351 1 1.01408 1 0.845161 chr1:3537996:rs2760321 T CEU TSI PASS PASS AA=C;BN=100;GP=1:3548136;HM2;HM3 . C 0.845588 1 18 24 73 63 1 16 56 2 20 41 0.133643 1.6732 0.596225 2.36066 0.1829 1.76758 chr1:3538692:rs2760320 G CEU TSI PASS PASS AA=G;BN=100;GP=1:3548832;HM2;HM3 . C 0.0580645 1 13 5 89 66 76 13 0 61 5 0 0.226071 2.00121 0.211617 2.08684 1 0.743017 chr1:3538715:rs114376964 T TSI PASS AA=T;BN=132;GP=1:3548855 . C 0.00757576 1 0 1 0 66 0 0 0 65 1 0 1 87.6667 1 43.6667 1 133 chr1:3541597:rs116230480 C CEU PASS AA=C;BN=132;GP=1:3551737 . T 0.00561798 1 177 0 89 0 88 1 0 0 0 0 1 0.008450 1 0.0169492 1 0.00558659

   VAR      Variant identifier
   REF      Reference allele
   SAMPLES  Sample tags for which site is called, space-delimited
   FILTER   FILTERs for those sites, space-delimited
   VMETA    Variant meta-information
   CONMETA  Consensus sample-variant meta-information

   ALT      Alternate allele(s), comma-delimited
   MAF      Minor allele frequency
   HWE      P-value from Hardy-Weinberg disequilibrium test (exact test)
 
   MINA     Number of minor alleles in cases
   MINU     Number of minor alleles in controls
   OBSA     Number of non-null genotypes in cases
   OBSU     Number of non-null genotypes in controls

   REFA     Number of reference homozygotes in cases
   HETA     Number of reference/alternate heterozygotes in cases
   HOMA     Number of alternate/alternate homozygsotes in cases

   REFU     As above, in controls
   HETU
   HOMU

   P        p-value for single site association (allelic)
   OR       Allelic odds ratio

   PDOM     p-value, dominant model for alternate allele
   ORDOM    Odds ratio under dominant model

   PREC     Empirical p-valie, recessive model for alternate allele
   ORREC    Odds ratio under recessive model

If any of the four cells are zero, the Woolf-Haldane correction is applied (adding 0.5 to all cells) in order to calculate an odds ratio.

Linear and logistic regression tests

To run a linear or logistic regression on each single variant, use the glm command:

pseq proj glm --phenotype phe1 --covar mds1 mds2

The type of test will depend on the phenotype (quantitative trait or dichotomous disease trait). Currently only asymptotic p-values are given. Naturally, for rare alleles these analyses will be under-powered and/or unreliable: a mask options such as maf=0.02 should probably be routinely applied.

VAR REF ALT N F F_A F_U OR SE STAT P chr1:3537996:rs2760321 T C 136 0.845588 0.876712 0.809524 1.69261 0.3453 1.52369 0.127586 chr1:3538692:rs2760320 G C 155 0.0580645 0.0730337 0.0378788 2.08684 0.5536 1.32885 0.183898 chr1:6447088:rs11800462 T C 150 0.0733333 0.0697674 0.078125 0.87567 0.4640 -0.286074 0.774822 chr1:11633148:rs9614 T G 99 0.146465 0.117021 0.173077 0.68816 0.3757 -0.994677 0.319893 chr1:17786644:rs34417109 G A 151 0.0662252 0.0755814 0.0538462 1.47554 0.5007 0.776927 0.437202 chr1:17786709:rs35497285 G A 148 0.0777027 0.0823529 0.0714286 1.1831 0.4638 0.362493 0.716984 chr1:17833932:rs34725104 C T 130 0.142308 0.110294 0.177419 0.57904 0.3616 -1.51093 0.130808 chr1:17853771:rs35255680 A C 144 0.131944 0.15 0.109375 1.42405 0.3563 0.992085 0.321156 chr1:18022097:rs694214 G T 140 0.435714 0.469512 0.387931 1.43869 0.2593 1.40253 0.160757

  VAR    Variant identifier
  REF    Reference allele
  ALT    Alternate allele(s)
  N      Number of individuals included in analysis
  F      Frequency of the alternate allele(s)
  F_A    For logistic models, case frequency
  F_U    For logistic models, control frequency
  OR     For logistic models, odds ratio
  BETA   For linear models, regression coefficient
  SE     Standard error of estimate
  STAT   Test statistic (Z or T)
  P      Asymptotic p-value

The coefficients for covariates are not displayed.

Gene/group-based association tests

To perform set-based association (where the set may be a gene, pathway, or any other entity), use the command:

pseq proj assoc --phenotype my.phenotype --mask loc.group=refseq

The assoc command currently assumes a dichotomous phenotype. By default, this command uses adaptive permutation. To use a fixed number per gene/region add:

--perm 10000

Other filters can be specified with the mask here, although a grouping factor must be present -- refseq in this case, which refers to a group in the LOCDB. Typically, given a focus on rare variants, the mask will filter on minor allele count or frequency:

pseq proj assoc --phenotype phe1 --mask loc.group=refseq mac=2-20
LOCUS      POS                         ALIAS               NVAR  TEST      P           I           DESC
NM_000021  chr14:72742931..72742931    CCDS9812.1,PSEN1    1     BURDEN    0.538462    0.153846    3/2
NM_000043  chr10:90752918..90758660    CCDS7393.1,FAS      3     BURDEN    0.6         0.2         8/9
NM_000055  chr3:166973982..167031223   CCDS3198.1,BCHE     6     BURDEN    0.833333    0.5         4/9
NM_000077  chr9:21960916..21960916     CCDS6510.1,CDKN2A   1     BURDEN    0.714286    0.285714    2/4
NM_000112  chr5:149337637..149341438   CCDS4300.1,SLC26A2  9     BURDEN    0.0555556   0.008547    14/4
NM_000113  chr9:131620904..131620904   CCDS6930.1,TOR1A    1     BURDEN    0.714286    0.714286    0/1
...

The NVAR field is the number of variants (meeting the mask). The default test is a simple comparison of the BURDEN of case to control minor alleles. The P field is based on permutation, the empirical significance. The I field indicates the proportion of null replicates for which the best test statistic was tied. That is, this is a measure of the discreteness of the empirical distribution of the test statistic, and effectively gives a sense of the smallest possible empirical p-value. In other words, for a gene with only a single singleton variant, in a sample of equal cases and controls, half the time you'd expect to see the variant in a case (under null permutations of the data) and half the time in a control. Thus I would be 0.5. Filtering out tests with very high I values is one way to achieve a set of tests for which the distribution of p-values will be more interpretable. This is comparable to removing SNPs with very low minor allele frequencies (e.g. under 1%) in GWAS.

There are a number of other gene-based tests available in PSEQ (and more being added). Currently available tests include, that are included with the --tests argument:

Output label Statistic keyword for --tests Directionality
BURDEN Excess of rare alleles in cases compared to controls no-burden (test included by default, use to remove) 1-sided
UNIQ Count of case-unique rare alleles uniq 1-sided
VT Variable threshold test, Price et al, 2010 vt 1-sided
FQRWGT Frequency-weighted test, in spirit of Madsen-Browning fw 1-sided
CALPHA C-alpha test, Neale et al, 2011 calpha 2-sided
SUMSTAT Sum of single-site statistics sumstat 2-sided

Multiple tests can be added at the same time:

--tests calpha vt

For the BURDEN test, the DESC field contains the number of case/control minor alleles. For UNIQ, it contains the number of case-only non-reference alleles in cases/controls. If the C-alpha test is performed, a frequency breakdown is given in DESC. This indicates the number of variants with a particular case/control count of the minor allele, e.g.

   0/1=3;1/0=2;29/25=1

means 3 control-only singletons, 2 case-only singletons and 1 common variant (29 case alleles, 25 control alleles) were observed.

Additional tests: A large number of other gene-based have and are being developed and we plan to implement a proportion of these. In collaboration with the authors of the respective packages, three tests that have been incorporated in the development version of PSEQ and will be released soon, pending a round of testing:

Collaboration? If you are the author of an appropriate test and would like to discuss ways to implement it within the PLINK/Seq framework, please contact us.

Permutation

Most association tests are currently based on permutation to assess empirical p-values, invoked either by default (for gene-based tests) or via the --perm option. A few important points to note:

Maintaining correlation between tests: where possible, for each entity tested (variant, gene, pathway) an identical set of permutations is performed (label-swapping of phenotype across individuals). As such, this will preserve the pattern of correlation in test statistics (due to LD) across sites.

Stratified permutation: if an apprpriate factor is present in the INDDB, one can performed stratified permutation (only label-swapping phenotype within strata) as follows:

--strata batch

(assuming that batch has been uploaded to the INDDB). Individuals who are missing for the stratifying variable will not be permuted.

Currently no support for family-data: permutation (and all currently implemented tests) are focused on samples of unrelated individuals. Development of family-based analyses and procedures will likely be an active area of future development.

Adaptive permutations: the number of permutations for each test can be specified:

--perm 10000

It is also possible to use adaptive permutations, in which a test is stopped early if it appears that the ultimate empirical p-value will not be highly significant:

--perm -1

Permutation and missing data: by default, permutation is performed without respect to missing genotype data. (By default, any individual with missing phenotype data will be removed prior to initiating the association test.) To permute phenotype labels only within individuals with no missing genotype data for that variant (or group of variants, in gene-based tests) add the argument:

--fix-null

Experiment-wide multiple-test correction: experiment-wide p-values can be calculated at the end of the analysis. By default this option is disabled in the current release.